Nau mai, haere mai. Welcome to BIOSCI 738

Lecturer: Dr. C. M. Jones-Todd

With thanks to Prof. Chris Triggs, Dr Kathy Ruggiero, and Prof. James Russell who all freely gave me their previous iterations of the course and told me to do what I wanted with their notes. I did, and they may well find sections of this material hauntingly familiar! Credit also to all the other fantastic people who’ve shared materials or have made notes freely available online. I have endeavored to reference all materials, and more, where appropriate.

© 2024, Dr. C. M. Jones-Todd

Key Information

Overview

This is a postgraduate course geared towards students of biology, ecology, and environmental science. Building on a strong foundation in quantitative biology, fundamental statistical methods and basic R programming, you will learn an array of advanced biostatistical methods for data analysis. Topics covered include data wrangling, methods for the analysis of designed experiments, regression analysis, including mixed effect models, and the analysis of multivariate data, including advanced supervised and unsupervised learning techniques.

Prerequisites

BIOSCI738 is a postgraduate course and so is pitched at a level where some fundamentals of R and stats are already assumed (e.g., topics covered in BIOSCI220). Topics (e.g., linear regression, hypothesis testing etc.) are recapped in the material, but the pace is quite a bit quicker than for a UG course (as we do then delve further into them). It is expected that you will do quite a bit of further reading/working through problems in your own time. How much extra time this takes up will depend on what baseline knowledge you come into the course with. We will use the programming language R throughout the course (through RStudio) and you are expected to be familiar with data import, manipulation, and visualization. If you are unfamiliar with R it is expected that you will prepare accordingly before the semester begins. If you find yourself struggling a lot with basic R tasks then you’ll likely have to spend a lot more time tacking the questions etc. The provided materials will definitely help you here, but you’d likely find that you’d need to recap & cover them in more detail than we might cover in the lectures/labs.

These tasks are designed for you to self-assess your R and statistics knowledge. If you fail to complete the coding task and score less than 70% on the quiz you will likely struggle with the pace and content of the course. This does not preclude you from taking the course, however, you should expect to invest time on top of what is expected for a postgraduate course into ensuring that you understand the material. If you feel the need to brush up on your R skills here are two resources I’d recommend giving a whorl 1) RStudio Education and 2) R for Data Science. There are plenty of others out there so if these don’t suit you then I’d strongly encourage you to try some others until you find something that suits your learning style.

Lectures

Lectures are weekly (weeks 2–12)
  • Mondays: 9–11 am (Sci Maths & Physics, Room B05), and
  • Fridays: 2–4 pm (B201, Room 3429).

You are expected to attend all lectures and to bring along your own device (BYOD) with the appropriate software installed. Each 2-hour lecture will involve a mixture of group work and practical tasks that focus on building computational and inference skills.

Office hours

I hold weekly consultation hours during the semester:

  • Mondays & Tuesdays 12—1pm, &
  • Fridays 1—2pm.

These are held in my office in building 303, room 318. If you are unable to make these times just email me @ c.jonestodd@auckland.ac.nz and we can sort something else out.

Assessment summary

Overview of assessment timeline

Course policies

Academic honesty

TLDR: Don’t cheat!

You are expected to abide by the following as you work on assignments in this course:

  • All submitted work must be your own. Copying verbatim from sources (e.g., internet, or AI-generated content, the courseguide) without proper citation will be treated as plagiarism.
  • You may discuss assignments with other students; however, you may not directly share (or copy) code or write up with other students. Unauthorized sharing (or copying) of the code or write up will be considered a violation for all students involved.
  • Unless explicitly stated otherwise you may make use of online resources (e.g., StackOverflow or Copilot etc.) for coding examples on assignments. If you directly use code from an outside source (or use it as inspiration), you must explicitly cite where you obtained the code. Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism.

Extensions

Note that extensions will only be granted in exceptional circumstances (e.g., illness with a doctor’s note). Last-minute requests due to poor time management, workload from other courses, or minor inconveniences will not be considered sufficient to grant an extension. The goal is to keep everyone on track together;due dates for assignments are there to help you keep up with the course material and to ensure the I can provide feedback within a timely manner.

No single assessment in this course is worth more than 15% of your final grade. This is by design so that 1) no one assignment should feel overwhelming, and 2) you have multiple opportunities to demonstrate your learning in a variety of ways (see Assessment summary for a breakdown of assessment design). Because of this, deadlines for other courses or a busy schedule are not valid reasons for extensions. You’re given plenty of time to complete each assignment, so I encourage you to start early and plan ahead! If you’re facing a genuine issue that will prevent you from submitting your work it is imperative that you get in contact with me as soon as possible so that we might make alternative arrangements and figure out the best way to support you.

FAQs

What are some recommended resources to help me on my R learning journey? Some to start out with: 1) RStudio Education, 2) R for Data Science, and 3) Modern Statistics for Modern Biology.

How do I receive communications and updates on assignments for this course? All course communication will be via Canvas Announcement, I expect you to keep up to date with these!

Do I need to attend all lectures? Yes. I expect students to attend all lectures. You should only miss class in cases of emergencies or serious illness. Otherwise, I expect you to be there, just as you’d expect me to be!

Are the classes recorded? No, all lectures are in-person and will not be recorded.

What if I miss a lecture? The expectation is that you catch up on the materials and content yourself before the next session.

Do I need to being a laptop to class? Yes, please bring along your own device (BYOD) with the appropriate software installed as each lecture will likely involve practical tasks that focus on building computational and inference skills.

I don’t have my own laptop. No worries, please let me know and I can organise a loan laptop for you.

Can I use AI tools in this class? It depends! We will discuss this in more detail when lectures start. Please refer to the Academic honesty section with regards to submitted work.

I need an extension for an assignment, how do I do that? Please read the Extensions section.

When will I get my assignment grades? Feedback and solutions will be provided as soon as possible and uploaded via CANVAS. Be sure to review your feedback carefully and take some time to reattempt the tasks in line with the solutions as often there are multiple ways of doing the same thing!

What if I have a query about my assignment grade? Firstly, please read your feedback carefully and take some time to reattempt the tasks in line with the solutions. If you still have concerns, please email me @ c.jonestodd@auckland.ac.nz with a clear explanation as to what specific part of the grade or feedback you are querying.

How can I get help if I’m struggling with the content? Reach out! Ask me questions in class, attend my Office hours, or email me @ c.jonestodd@auckland.ac.nz. You are also welcome to drop by my office unnanounced and check if I’m free, but please note I may be in a meeting or working from home some days.

Any other queries or concerns? Please contact me @ c.jonestodd@auckland.ac.nz

Support

The Te Papa Manaaki | Campus Care team provides a safe, confidential and free service that supports the health, well-being and safety of everyone at the University.

Several crisis helplines are available if you are worried about your safety or the safety of someone else, including the Mental Health Crisis Service (Phone: 0800 800 717). Rather text? Need to talk @ 1737

There are many personal, academic and learning, financial (including emergency funds), and tech support services available to all students—learn more here.

Whai Ora Be Well offers a range of tools and information to help you care for your physical, emotional and spiritual wellbeing, so you can thrive.

Module 1

This course is not designed as a course in R. Throughout we use R and RStudio as tools to carry out statistical inference, but it is assumed that you are already familiar with the basics of these software. You will be led through multiple examples in the chapters to come and it is expected that you invest time working through them, working out what each line of code does. It is inadvisable that you simply copy & paste the examples without an idea of what is being asked. Figuring out the code for yourself will be a huge help when you come to modify it to suit your own purposes.

Rather than telling you how to carry out specific operations this section of the course guide outlines practices and tools focusing on employing good scientific practice. We discuss the best practices we should employ when dealing with data, code, and our obligations in drawing inferences from our analyses. There are many ongoing discussions in the scientific community around ethical data practice and what it entails. It is a hugely important subject, and in many ways has a long way to go. We only touch briefly on some aspects of ethical data practice in this section, presenting the suggestions and thoughts of the relevalnt communities and organisations in those areas.

  1. Accuracy and Honesty: Being accurate and honest in your analyses and conclusions.

  2. Respectful Handling of Data: Recognize that data may represents people or beliefs or behaviour and be respectful of this.

  3. Awareness of Consequences: Considering the ethical implications and societal impact of your research.

TASK Read Ethical Guidelines for Statistical Practice and outline which principle(s) you feel directly relate to your current career pathway.

R and RStudio

Throughout this course we will be programming using R. R is a free open-source statistical programming language created by Ross Ihaka (of Ngati Kahungunu, Rangitane and Ngati Pakeha descent) and Robert Gentleman, here at UoA in the Department of Statistics! It is widely used across many scientific fields for data analysis, visualization, and statistical modeling. Proficiency in R will enable you to wrangle and explore datasets, conduct statistical analyses, and create visualizations to communicate your findings. These are all essential tools required in any scientific discipline.

TASK Go to the Statistics Department display in the science building foyer (building 302) to see a CD-Rom with the first ever version of R.

RStudio is an integrated development environment (IDE) for the R programming language. It serves as a user-friendly workspace, offering additional tools for coding, data visualization, and project management. RStudio simplifies the process of learning R by providing a structured interface with many built-in tools to help organize workflow, fostering a systematic approach to data analysis and research.

TASK Research the meaning of open-source software and briefly outline the pros and cons of this in the context of statistical analysis.

Recap: R terminology

Term Description
Script A file containing a series of R commands and code that can be executed sequentially.
Source To execute the entire content of an R script, often done using the “Source” button in RStudio.
Running Code The process of executing R commands or scripts to perform specific tasks and obtain results.
Console The interactive interface in RStudio where R commands can be entered and executed line by line.
Commenting Adding comments to the code using the # symbol to provide explanations or annotations. Comments are ignored during code execution.
Assignment Operator The symbol <- or = used to assign values to variables in R.
Variable A named storage location for data in R, which can hold different types of values.
Data Type The classification of data into different types, such as numeric, character, logical, etc.
Object A data structure that holds a specific type of data. Objects are used to store and manipulate data, and they can take various forms depending on the type of information being represented.
Logical Operator Symbols like ==, !=, <, >, <=, and >= used for logical comparisons in conditional statements.
Function A block of reusable code that performs a specific task when called (e.g., mean(c(3, 4)).
Argument The input values that are passed to the function when it is called.
Error Handling The process of anticipating, detecting, and handling errors that may occur during code execution.
Debugging The process of identifying and fixing errors or bugs in the code.
Workspace The current working environment in R, which includes loaded data, variables, and functions.

Which R syntax?

This is not a comprehensive list of R syntax, but a call to develop your own coding style and use whatever you are most comfortable with. Throughout this course I will likely use a mix of syntax and functions to carry out the same operations. This is (somewhat) by design, and is intended to expose you to a strength1 of R, which is that there are multiple ways of doing the same thing. As long as it works and is reproducible then whatever works for you, although of course there are some recommended practices, see Good coding practice .

Example <- vs =

Both <- and = are assignment operators. It is mainly a personal choice which you use2. However, there is a difference depending on which environment you evaluate the assigning. In the top level there is no difference; however, = does not act as an assignment operator in a function call, whereas <- does.

TASK Both lines of code below give you a two row matrix, can you work out what the difference is between them (HINT look at what is created in your environment after each line)?

matrix("BIOSCI78", nrow = 2)
matrix("BIOSCI78", nrow <- 2)

Example %>% vs |>

Recall the tidyverse (specifically the magrittr) pipe operator %>%, which allows us to combine multiple operations in R into a single sequential chain of actions. There is also a base pipe |> which acts similarly, but not exactly. Essentially, |> passes the left-hand side as the first argument in the right-hand side call. This is subtly different from %>% as it cannot pass the left-had side onto multiple arguments. The %>% operator, however, does allow for this and you can also change the placement of the left-hand object with a . placeholder3.

TASK Given df <- data.frame(x = 1:10, group = rep(c("A", "B"))) are all the following operations equivalent?

split(df, df$group)
df %>% split(.$group)
df |> split(df$group)

Error handling and debugging

Sometimes rather than doing what you expect it to R will return an Error message to you via the Console prefaced with Error in… followed by text that will try to explain what went wrong. This, generally, means something has gone wrong, so what do you do?

  1. Read it! THE MESSAGES ARE WRITTEN IN PLAIN ENGLISH (MOSTLY)
  2. DO NOT continue running bits of code hoping the issue will go away. IT WILL NOT.
  3. Try and work out what it means and fix it (some tips below).

Interpreting R errors is an invaluable skill! The error messages are designed to be clear and informative. They aim to tell might be going wrong in the code. These messages typically contain information about the type of error hinting at how you might fix it. For example, if there is a typo in a function or variable name, R will produce an error like "Error: object 'variable_name' not found." This suggests that R couldn’t find the specified object variable_name.

Understanding error messages involves paying attention to the details, such as the error type (e.g., syntax error, object not found, etc.), and the specific line number where the error occurred! You should always run your code line-by-line, especially if you are new to programming. This means that the execution of each line of code is done in in isolation, providing immediate feedback and pinpointing the exact location where an error occurs. If the meaning or solution to an error isn’t immediately obvious then make use of the documentation (RTFM) and even technology. Online platforms like Stack Overflow or RStudio Community and tools like ChatGPT can often be your friend! It is very unlikely that you are the first one to have encountered the error and some kind soul with have posted a solution online (which the AI bot will have scraped…). However, it’s crucial to approach online answers carefully. You need to first understand the context of your error and use your knowledge about your issue to figure out if the solution is applicable (this gets easier with experience).

It is always best to try debugging yourself before blindly following a stranger’s advice. Debugging is a crucial aspect of R programming and the process itself helps solidify your understanding. There are several widely used methods that help identify and resolve issues in your code, two are outlined below.

  • Print debugging involves actively printing out objects and asking the software to display variable values or check the flow of execution. For example, if you get this error message "Error: object 'variable_name' not found. I might ask R to print out what objects do exist in my workspace, it could be that a previous typo means that the object variable_nam exists.
  • Rubber duck debugging is a useful debugging strategy where you explain your code or problem out loud (as if explaining it to a rubber duck). Honestly it works! The process helps clarify your thoughts and identify issues even before seeking external help.

The table below summaries some common R issues and errors and their solutions.

Error/Issue Description Solution
Syntax Error Incorrect use of R syntax, such as missing or mismatched parentheses, braces, or quotes. Carefully review the code, check for missing or mismatched symbols, and ensure proper syntax is used.
Object Not Found Attempting to use a variable or function that hasn’t been defined or loaded into the workspace. Check for typos in variable or function names, ensure the object is created or loaded, and use correct names.
Package Not Installed/Loaded Trying to use a function from a package that is not installed or loaded into the environment. Install the required package using install.packages("package_name"), and load it using library(package_name).
Undefined Function/Variable Using a function or variable that hasn’t been defined or is out of scope. Define the function or variable, or check its scope within the code.
Data Type Mismatch Operating on data of incompatible types, such as performing arithmetic on character data. Ensure that data types are compatible, and consider using functions like as.numeric() or as.character() for conversions.
Misuse of Assignment Operator Using the wrong assignment operator (= instead of <- or vice versa). Be consistent with the assignment operator, and use <- for assignment in most cases.
Missing Data Dealing with missing values in the data, leading to errors in calculations or visualizations. Handle missing data appropriately, using functions like na.omit() or complete.cases().
Index Out of Bounds Attempting to access an element in a vector or data frame using an index that is out of range. Check the length of the vector or data frame and ensure the index is within bounds.
Failure to Load a File Issues with loading a data file using functions like read.csv() or read.table(). Check the file path, file format, and encoding. Confirm that the file exists in the specified location.
Incorrect Function Arguments Providing incorrect arguments or parameters to a function, leading to errors. Refer to the function’s documentation to understand the correct parameters and ensure proper usage.

Sometimes your computer will return a warning messages to you prefaced “Warning:”. These can sometimes be ignored as they may not affect us. However, READ THE MESSAGE and decide for yourself. Occasionally, also your computer will write you a friendly message, just keeping you up-to date with what it’s doing, again don’t ignore these they might be telling you something useful!

TASK Keep a bug diary! Each time you get an error message, see if you can solve it and keep a diary of your attempts and solution.

Accuracy and Honesty

Honesty is an expectation: we expect honesty from you and you expect the same from your teaching team. Honesty is an expectation in any scientific discipline as is accuracy. These are morals, ethical principles we should abide by. But this course isn’t here to discuss philosophy or character development. This course, in particular this section of the course, aims to expose you to the tools and principles that will aid you in your own pursuit of ethical data practice. Teaching you the tools so that your analysis is reproducible goes someway towards ensuring accuracy in your research. This because, reproducibility promotes transparency, facilitates error detection and correction, and contributes to the overall reliability and accuracy of your research findings.

Reproducible research

“Reproducibility, also known as replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated. … With a narrower scope, reproducibility has been introduced in computational sciences: Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.”
— Reprodicibility, Wikipedia

Reproducibility is a stepping stone towards ensuring accuracy. This is because, reproducibility promotes transparency, facilitates error detection and correction, and contributes to the overall reliability and accuracy of your research findings. Establishing good practice when dealing with data and code right from the beginning is essential. Good practice 1) ensures that data is collected, processed, and stored accurately and consistently, which helps maintain the quality and integrity of the data throughout its lifecycle; and 2) creates a robust code base, which can be easily understood and adapted as the project progresses, which leads to faster development.

Good coding practice

You should always start with a clean workspace. Why? So your ex (code) can’t come and mess up your life!

To ensure that RStudio does not load up your previous workspace go to Tools > Global Options and uncheck the highlighted options below.

The reasoning may not be immediately obvious; however, it is something you will later regret if you don’t start as you mean to go on! Loading up a previous workspace may seem convenient as your previous objects and code are immediately on-hand. However, this is the exact reason that it is not good practice, loading up a previous workspace is NOT reproducible, does NOT create a fresh R process, makes your script vulnerable, and it will come back to bite you.

TASK Below are two quotes from Jenny Bryan, an R wizard which reference two snippets of R code. Find out what each snippet does and why Jenny is so against them.

If the first line of your R script is setwd("C:\Users\jenny\path\that\only\I\have") I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
— Jenny Bryan, Tidyverse blog, workflow vs script
If the first line of your R script is rm(list = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
— Jenny Bryan, Tidyverse blog, workflow vs script

A project-oriented workflow in R refers to a structured approach to organizing and managing your code, data, and analyses. This helps improve reproducibility and the overall efficiency of your work. Within this it is essential essential to write code that is easy to understand, maintain, and share. To do so, coding best practice is to follow the 5 Cs by being

  1. Clear
    • Code Clarity: Write code that is easy to read and understand. Use meaningful variable and function names that convey the purpose of the code. Avoid overly complex or ambiguous expressions.
    • Comments: Include comments to explain the purpose of your code, especially for complex or non-intuitive sections. Comments should add value without stating the obvious.
  2. Concise:
    • Avoid Redundancy: Write code in a way that avoids unnecessary repetition. Reuse functions and use loops or vectorized operations when appropriate to reduce the length of your code.
    • Simplify Expressions: Simplify complex expressions and equations to improve readability. Break down complex tasks into smaller, manageable steps.
  3. Consistent:
    • Coding Style: Adhere to a consistent coding style throughout your project. Consistency in indentation, spacing, and naming conventions makes the code visually coherent.
    • Function Naming: Keep naming conventions consistent. If you use camelCase for variable names, continue to use it consistently across your codebase.
  4. Correct:
    • Error Handling: Implement proper error handling to ensure that your code gracefully handles unexpected situations. Check for potential issues, and provide informative error messages.
    • Testing: Test your code to ensure it produces the correct output. Use tools like unit tests (e.g., with testthat) to verify that your functions work as intended.
  5. Conformant:
    • Follow Best Practices: Adhere to best practices and coding standards in the R community. For example, follow the tidyverse style guide or the Google R Style Guide.
    • Package Guidelines: If you’re creating an R package, conform to package development guidelines. Use the usethis package to help set up your package structure in a conformant way.

TASK Read Good enough practices in scientific computing and briefly summarise why good coding practices are key to any scientific discipline.

There are many other good practice tips when it comes to coding these include ensuring your code is modular, implementing unit testing, automating workflows and implementing version control. In this course you will be using git to manage your project code and data. Use of git, or similar, will very likely be an expectation of your future career, see Version control with git and GitHub for an introduction to these tools.

File naming policies

Item What Why
Be nice to your computer: No white spaces as some systems can be confused by them Ensures compatibility across different systems and prevents errors
No special characters (e.g., *, ^, +, …) Prevents interpretation issues and potential conflicts with system functions
Don’t assume case is meaningful Avoids confusion on systems that do not differentiate by case
Consistency is KEY Avoids confusion in general
Be nice to yourself and your collaborators Concise and descriptive names Just makes life easier :)
Make sorting and searching easy Dates should follow YYYY-MM-DD format Standardizes date representation makes sorting and interpretation easy
Use numbers as a prefix to order files Enables sequential sorting of files regardless of system
Left pad with 0 so numbers have the same length Ensures proper numerical order when sorting files
Use keywords You can search these!

TASK Work through Danielle Navarro’s presentation on Project Structure and see how many pitfalls you have fallen into to-date.

Version control with git and GitHub

Git is a version control system that manages the evolution of a set of files, called a repository (repo), in a structured way (think of Word’s Track Changes). With git you can track the changes you make to your project/code. You will always have a record of what you’ve worked on and can easily revert back to an older version if you make a mistake. GitHub is a hosting service that provides a home for your git-based projects on the internet (think of Dropbox). In addition, GitHub offers functionality to use git online via an easy-to-use interface. Both git and GitHub can very easily be configured to work with RStudio.

Below are some key terms you will undoubtedly hear when delving into the git–GitHub world.

Repository (already mentioned) This where the work happens–think of it as your project folder. It should contain all of your project’s files etc.

Cloning A repository on GitHub is stored remotely in the cloud. To create a local copy of this repository you can clone it and use Git to sync the two.

Committing and pushing are how you can add the changes you made on your local machine to the remote repository in GitHub. You can make a commit when you have made milestone worthy changes to your project. You should also add a helpful commit message to remind future you, or your teammates, what the changes you made were (e.g., fixed the bug in my_function).

The table below gives Ten Simple Rules for Taking Advantage of Git and GitHub, which outline how to get the most out of using these softwares.

Title Why
Rule 1 Use GitHub to Track Your Projects Keeps your work backed up and in order
Rule 2 GitHub for Single Users, Teams, and Organizations Flexible repo rights helps project management
Rule 3 Developing and Collaborating on New Features: Branching and Forking Easily copy and create your own version of a project to modify
Rule 4 Naming Branches and Commits: Tags and Semantic Versions Consistency helps your collaborators and end-users
Rule 5 Let GitHub Do Some Tasks for You: Integrate Continuous integration helps make sure your code is ready to go as soon as possible
Rule 6 Let GitHub Do More Tasks for You: Automate Automating tasks means less manual work and more reliable testing
Rule 7 Use GitHub to Openly and Collaboratively Discuss, Address, and Close Issues Promotes collaboration
Rule 8 Make Your Code Easily Citable, and Cite Source Code! Just overall good practice and aids reproducibility
Rule 9 Promote and Discuss Your Projects: Web Page and More Promoting your work is likely always good for your career
Rule 10 Use GitHub to Be Social: Follow and Watch GitHub is a nice way to follow developments in your field and see others work

TASK Read Ten Simple Rules for Taking Advantage of Git and GitHub and briefly expand on the points above using examples from your studies/careers to-date.

Setting up

  1. Register an account with GitHub https://github.com. Choose the free option!
  2. Make sure you’ve got the latest version of R
R.version.string
## [1] "R version 4.4.2 (2024-10-31)"
  1. Upgrade RStudio to the new preview version (optional)
  2. Install git: follow these instructions
  3. Get started

Cloning a repository from GitHub using RStudio

  1. In GitHub, navigate to the Code tab of the repository and on the right side of the screen, click Clone or download.
  2. Click the Copy to clipboard icon to the right of the repository URL

  3. Open RStudio in your local environment
  4. Click File, New Project, Version Control, Git

  5. Paste the repository URL and enter TAB to move to the Project directory name field.

6.Click Create Project. Your Files pane should now look similar to this

Commiting and pushing changes

  1. Open a file from your project directory (here I’ve opened the file README.md). Note that the Git pane (top right) is empty

  1. Make a change to your file and save. Now note that the Git pane (top right) is not empty:

  1. Check this file in the Git tab (it is now staged for commit).

  2. Click the Commit button. A new pane will open. Changes made to the file will be highlighted (additions in green and deletions in red). Now write your self an informative message in the top right of this pop-up:

  1. Click the Commit button below the message you’ve just written. A new pop up will let you know how things are going! You can then close both popups.

  1. Now you’ll see RStudio has left you a little message in the Git tab, something similar to Your branch is ahead of origin/master by 1 commit. This means that you’ve made and committed your changes locally (i.e., on your computer) but you are yet to push these changes to GitHub.

  2. To push to GitHub press the Push button,

  3. A new pop up will let you know how things are going! You can then close this once it gives you the option to.

System specific hurdles

As with any new software there are likely to be a few teething issues! Below are some previous classmate’s solutions4 to issues encountered when commiting and pushing changes using git via RStudio. These initial issues are likely due to authentication.

Windows config file

Assuming your current working directory is at the top level of your repo/project then choose the Files tab in your plotting pane.

  1. Then choose the More tab and the Show Hidden Files option:

  1. Navigate to the .git folder and open the file named config:

  1. Open config and add the following code replacing the relevant bits with your GitHub username/handle and email.
[user]

name = <adding your GitHub username here>

email = <adding your GitHub email here>
  1. Save the changes to this file and you should be all good to commit!

Personal Access Token authentication (linking GitHub to your project)

Using your Github email and password as authentication was decommissioned a while ago. Instead a Personal Access Token is required.

  1. Sign into Github and follow these instructions to create a Personal Access Token

  2. Once completed a new popup will appear with what looks like a random jumble of letters and numbers (in the format ghp_*************), this is your Personal Access Token. Copy this! Or put it in your notes, or password manager, for later

  3. Return to RStudio clone your desired repository (e.g., the one for the assignment):

  • Create a New Project
  • Click on Version Control
  • Click on Git
  • Enter the URL from the repository link from Canvas
  • Make a name for the project (or just use the default)
  • Click Create

You will now be asked for your GitHub details:

  • When asked for your username type your GitHub account username/handle
  • When asked for your password, input the Personal Access Token you just created, not your GitHub account password

Note, depending on the choices you made when creating your Personal Access Token, it may run out (the default duration is 30 days) and your authentication stop working. No worries, simply create a new one one use this as your password!

Respectful Handling of Data

Data sovereignty

The term data sovereignty has a hugely broad range of connotations. We do not aim to cover them all in this course. Typically, data sovereignty is focused on the understanding and adhering to the legal and ethical considerations associated with data collection, storage, and analysis in different jurisdictions. Researchers are expected follow and abide by the best ethical and legal practice whilst respecting individuals’ privacy and rights.

“For Indigenous peoples, historical encounters with statistics have been fraught, and none more so than when involving official data produced as part of colonial attempts at statecraft.”
— Lovett, R., Lee, V., Kukutai, T., Cormack, D., Rainie, S.C. and Walker, J., 2019. Good data practices for Indigenous data sovereignty and governance. Good data, pp.26-36.

Indigenous data sovereignty refers to the right or interest indigenous peoples and nations have to govern the collection, ownership, and application of their own data. As we are in Aotearoa this is most pertinent in the terms of māori data sovereignty.

Māori Data Sovereignty principles

“Māori Data Sovereignty has emerged as a critical policy issue as Aotearoa New Zealand develops world-leading linked administrative data resources.”
— Andrew Sporle, Maui Hudson, Kiri West. Chapter 5, Indigenous Data Sovereignty and Policy

Te Tiriti o Waitangi/Treaty of Waitangi obliges the Government to actively protect taonga, consult with Māori in respect of taonga, give effect to the principle of partnership and recognize Māori rangatiratanga over taonga.

TASK There are many ongoing discussions that surround the Te Tiriti o Waitangi/Treaty of Waitangi, focused mainly on whether the statements of principles it outlines have been upheld. What do you think? Can you find a recent event/news article that relates to your studies, which either clearly upholds, or not, what you deem to be the correct ethical practice here?

Māori Data Sovereignty principles inform the recognition of Māori rights and interests in data, and promote the ethical use of data to enhance Māori wellbeing.

The Te Mana o te Raraunga Model was developed to align Māori concepts with data rights and interests, and guide agencies in the appropriate use of Māori data. Below are the guiding principles outlined by Te Mana o te Raraunga Model in th e Principles of Māori Data Sovereignty.

  • Rangatiratanga (authority)

    • Māori have an inherent right to exercise control over Māori data and Māori data ecosystems. This right includes, but is not limited to, the creation, collection, access, analysis, interpretation, management, security, dissemination, use and reuse of Māori data.
    • Decisions about the physical and virtual storage of Māori data shall enhance control for current and future generations. Whenever possible, Māori data shall be stored in Aotearoa New Zealand.
    • Māori have the right to data that is relevant and empowers sustainable self-determination and effective self-governance
  • Whakapapa (relationships)

    • All data has a whakapapa (genealogy). Accurate metadata should, at minimum, provide information about the provenance of the data, the purpose(s) for its collection, the context of its collection, and the parties involved.
    • The ability to disaggregate Māori data increases its relevance for Māori communities and iwi. Māori data shall be collected and coded using categories that prioritise Māori needs and aspirations.
    • Current decision-making over data can have long-term consequences, good and bad, for future generations of Māori. A key goal of Māori data governance should be to protect against future harm.
  • Whanaungatanga (obligations)

    • Individuals’ rights (including privacy rights), risks and benefits in relation to data need to be balanced with those of the groups of which they are a part. In some contexts, collective Māori rights will prevail over those of individuals.
    • Individuals and organisations responsible for the creation, collection, analysis, management, access, security or dissemination of Māori data are accountable to the communities, groups and individuals from whom the data derive
  • Kotahitanga (collective benefit)

    • Data ecosystems shall be designed and function in ways that enable Māori to derive individual and collective benefit.
    • Build capacity. Māori Data Sovereignty requires the development of a Māori workforce to enable the creation, collection, management, security, governance and application of data.
    • Connections between Māori and other Indigenous peoples shall be supported to enable the sharing of strategies, resources and ideas in relation to data, and the attainment of common goals.
  • Manaakitanga (reciprocity)

    • The collection, use and interpretation of data shall uphold the dignity of Māori communities, groups and individuals. Data analysis that stigmatises or blames Māori can result in collective and individual harm and should be actively avoided.
    • Free, prior and informed consent shall underpin the collection and use of all data from or about Māori. Less defined types of consent shall be balanced by stronger governance arrangements.
  • Kaitiakitanga (guardianship)

    • Māori data shall be stored and transferred in such a way that it enables and reinforces the capacity of Māori to exercise kaitiakitanga over Māori data.
    • Ethics. Tikanga, kawa (protocols) and mātauranga (knowledge) shall underpin the protection, access and use of Māori data.
    • Māori shall decide which Māori data shall be controlled (tapu) or open (noa) access.

Awareness of Consequences

Considering the implications and societal impact of your research includes ensuring that any conclusions you draw are appropriately and accurately balanced. Consider the previous chapter Data Visualization, the guiding principles of making informed visualizations included not misleading readers and prioritizing conveying a clear message. These speak to being mindful of how your figures may be perceived and presenting your data ethically and responsibly.

Your responsibilities go beyond just making figures, they extend to the methods and inferences you draw. Learning how to communicate science is a key and invaluable skill. Siouxsie Wiles, is an award winning science communicator and is perhaps best known for stepping up during the pandemic giving us information about the virus and advice on how to beat it.

“I assumed it would be through my research by helping develop a new antibiotic. But through the pandemic, I’ve learned that I can have a huge impact globally by doing good science communication.”
— Associate Professor Siouxsie Wiles

5

Below is a case study in science (mis)communication.

Case study

Asthma carbon footprint ‘as big as eating meat’ is the headline of this news story published on the BBC website. The article is based research outlined in this published paper which in turn cites this paper for an estimate of the carbon footprint reduced by an individual not eating meat.

It is not unreasonable to assume that many people would interpret this to mean that the total global carbon footprint due to eating meat is equal to the total carbon footprint due to the use of asthma inhalers. However, this is not what they mean. They mean that an individual deciding not to eat meat reduces their carbon footprint as much as an asthmatic individual deciding not to use an inhaler.

There are far more meat consumers compared to inhaler users and so the overall carbon footprint associated with meat consumption is much greater. However, the claim that not eating meat reduces someone’s carbon footprint about the same amount as not using inhalers is questionable. Yet, both the BBC article and the paper make this claim.

“And at the individual level, each metered-dose inhaler replaced by a dry powder inhaler could save the equivalent of between 150kg and 400kg (63 stone) of carbon dioxide a year - similar to the carbon footprint reduction of cutting meat from your diet.”
“Changing one MDI device to a DPI could save 150–400 kg CO2 annually; roughly equivalent to installing wall insulation at home, recycling or cutting out meat.”

Now, the the carbon footprint of eating meat is estimated as 300–1600 kg CO2 annually by this paper (see Table 1). And so the two claims don’t really match up. Moreover, what is being suggested by the article? That should asthmatics should think about ceasing their medication in the same way many people are trying to reduce meat consumption?!?

In this section we’ve discussed how ethical data practice involves accuracy, respect, and clear communication. There is one other component that should be considered here and that is consequence. The two options in this case study are not balanced because they have very different consequences:

  • Not eating meat is (possibly) good for you and is also good for the planet, but
  • Not taking your inhaler is (probably) much worse for your health.

TASK Watch this lecture Algorithmic fairness: Examples from predictive models for criminal justice and summarise the key points made. Can you think of a recent story that highlights the issues raised?

Data Visualization

The guiding principles of making informed visualizations included not misleading readers and prioritizing conveying a clear message. These speak to being mindful of how your figures may be perceived and presenting your data ethically and responsibly.

“Scientific visualization is classically defined as the process of graphically displaying scientific data. However, this process is far from direct or automatic. There are so many different ways to represent the same data: scatter plots, linear plots, bar plots, and pie charts, to name just a few. Furthermore, the same data, using the same type of plot, may be perceived very differently depending on who is looking at the figure. A more accurate definition for scientific visualization would be a graphical interface between people and data.”
— Nicolas P. Rougier, Michael Droettboom, Philip E. Bourne, Ten Simple Rules for Better Figures

Exploratory plots (for your own purposes)

Exploratory plots are just for you, they focus solely on data exploration. They

  1. Don’t Have to Look Pretty: These plots are only needed to reveal insights.
  2. Just Needs to Get to the Point: Keep the plots concise and straightforward. Avoid unnecessary embellishments or complex formatting.
  3. Explore and Discover New Data Facets: Use exploratory plots to uncover hidden patterns, trends, or outliers in the data. Employ different plot types to reveal various facets and aspects of the dataset.
  4. Help Formulate New Questions: Use exploratory plots as a tool to prompt new questions and hypotheses. As you discover patterns, think about what additional questions these findings raise for further investigation.

Explanatory plots

“…have obligations in that we have a great deal of power over how people ultimately make use of data, both in the patterns they see and the conclusions they draw.”
— Michael Correll, Ethical Dimensions of Visualization Research

Explanatory plots are mainly for others. These are the most common kind of graph used in scientific publications. They should

  1. Have a Clear Purpose: Define a clear and specific purpose for your plot. Identify what scientific question or hypothesis the plot is addressing. Avoid unnecessary elements that do not contribute to this purpose.
  2. Be Designed for the Audience: Tailor the design of your plot to the characteristics and expertise of your audience. Consider their familiarity with technical terms, preferred visualizations, and overall scientific background.
  3. Be Easy to Read: Prioritize readability by using legible fonts, appropriate font sizes, and clear labels. Ensure that the axes are well-labeled, and use a simple and straightforward layout. Avoid clutter and unnecessary complexity that could hinder comprehension.
  4. Not Distort the Data: Maintain the integrity of the data by avoiding distortion in your plot. Ensure that scales and proportions accurately represent the underlying data, preventing misleading visualizations.
  5. Help Guide the Reader to a Particular Conclusion: Structure your plot in a way that guides the reader toward the intended conclusion. Use visual elements such as annotations, arrows, or emphasis to highlight key findings and lead the reader through the data interpretation process.
  6. Answer a Specific Question:Construct your plot with a specific research question in mind. The plot should directly address and answer this question, providing a clear and unambiguous response.
  7. Support an Outlined Decision: If the plot is intended to support decision-making, clearly outline the decision or action it is meant to inform. The plot should provide relevant information that aids in making well-informed decisions based on the presented data.

The table below summarises Ten Simple Rules for Better Figures, a basic set of rules to improve your visualizations.

Rule Name Rule Description
Know Your Audience Understand the characteristics and expertise of your audience to tailor the figure accordingly.
Identify Your Message Clearly define the main message or takeaway that you want the audience to understand from the figure.
Adapt the Figure to the Support Medium Tailor the figure’s complexity and design to suit the medium it will be presented in (e.g., print, online).
Captions Are Not Optional Craft informative captions that provide essential details and context for interpreting the figure.
Do Not Trust the Defaults Adjust default settings to optimize the visual elements of the figure, such as colors, scales, and labels.
Use Color Effectively Apply color purposefully, taking into account accessibility considerations and cultural interpretations.
Do Not Mislead the Reader Avoid creating misleading visualizations and be aware of formulas to measure the potential misleading nature of a graph. There are formulas to measure how misleading a graph is!
Avoid Chartjunk Eliminate unnecessary decorations and embellishments in the figure that do not contribute to the message.
Message Trumps Beauty Prioritize conveying a clear message over making the figure aesthetically pleasing.
Get the Right Tool Choose the appropriate visualization tool (e.g., R) or chart type that best communicates the data and intended message.

TASK Read this short paper and find examples in your choice of literature of one or more of the rules in action.

TASK Put your data wrangling and visualization skills to the test and find the hidden picture in this dataset.

Module 2

Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. The word inference means conclusions or decisions. Statistical inference is about drawing conclusions and making decisions based on observed data. Data, or observations, typically arise from some underlying process. It is the underlying process we are interested in, not the observations themselves. Sometimes we call the underlying process the population or mechanism of interest. The data are only a sample from this population or mechanism. We cannot possibly observe every outcome of the process, so we have to make do with the sample that we have observed. The data give us imperfect insight into the population of interest. The role of statistical inference is to use this imperfect data to draw conclusions about the population of interest, while simultaneously giving an honest reflection of the uncertainty in our conclusions. Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.

TASK Imagine that you have a fair coin and toss it 10 times. Write out all the possible outcomes alongside how likely you think each is.

R packages and datasets

It is assumed in all following examples of this module that the following code has been executed successfully.

packages used in this module

library(tidyverse) ## general functionality
library(MASS) ## geyser data
library(palmerpenguins) ## penguins data
library(gglm) ## nice lm() diagnostic plots
library(ggeffects) ## marginal predictions

datasets used in this module

base_url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/"
  1. jackal
## Mandible lengths (mm) for  golden jackals (Canis aureus) of each sex from the British Museum
jackal <- data.frame(mandible_length_mm = c(120, 107, 110, 116, 114, 111, 113, 117, 114, 112,
                                            110, 111, 107, 108, 110, 105, 107, 106, 111, 111),
                     sex = rep(c("Male","Female"), each = 10))
  1. paua
paua <- read_csv(paste(base_url, "paua.csv", sep = ""))
  1. MASS::geyser
data(geyser, package = "MASS")
  1. rats
rats <- read_csv(paste(base_url, "crd_rats_data.csv", sep = ""))
rats$Surgery <- as_factor(rats$Surgery)
  1. palmerpenguins::penguins
data(penguins, package = "palmerpenguins")

Permutation and randomisation tests

A permutation test is a statistical significance test where the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points. This is also known as an exact test.

Now, often finding all possible combinations is hugely computationally expensive so we harness the power of simulation and carry out a randomisation test, which randomly simulates (for a given number of repetitions) possible values of the test statistic under the null hypothesis to obtain its approximate distribution.

The basic approach to a randomisation tests is straightforward:

  1. Choose a statistic to measure the effect in question (e.g., differences between group means)
  2. Calculate that test statistic on the observed data. Note this metric can be anything you wish
  3. Construct the sampling distribution that this statistic would have if the effect were not present in the population (i.e., the distribution under the Null hypothesis, \(H_0\)): For chosen number of times
    • shuffle the data labels
    • calculate the test statistic for the reshuffled data and retain
  4. Find the location of your observed statistic in the sampling distribution. The location of observed statistic in sampling distribution is informative:
  • if in the main body of the distribution then the observed statistic could easily have occurred by chance
  • if in the tail of the distribution then the observed statistic would rarely occur by chance and there is evidence that something other than chance is operating.
  1. Calculate the proportion of times your reshuffled statistics equal or exceed the observed. This p-value is the probability that we observe a statistic at least as “extreme” as the one we observed State the strength of evidence against the null on the basis of this probability.

TASK Study this cheatsheet and link the relevant sections to each step given above.

A permutation test: Jackal mandible lengths

The jackal data contain mandible lengths (mm) for golden jackals (Canis aureus) of each sex, these data were collected from the British Museum’s archives.

Scientific question: Are the jaw lengths of jackals the same, on average, in both sexes?

Null hypothesis: The average jaw lengths in male jackals the same as for females

Test statistic: Difference of sample means

Let us first calculate the observed test statistic:

## observed statistic
jackal_mean_diff <- jackal %>%
  group_by(sex) %>%
  summarise(mean = mean(mandible_length_mm)) %>% 
  summarise(diff = diff(mean)) %>%
  as.numeric()
jackal_mean_diff
## [1] 4.8

Now we use the command combn(x, m) to generate all possible combinations of the elements of x taken m at a time. For our data we have 20 elements in total with two groups of 10 elements each, therefore, we use combn(20,10). We can use these combinations to generate all possible combinations and calculate the proportion of times the test statistic calculated under the null hypothesis is as least as extreme as the one observed (the p-value):

combinations <- combn(20,10)
## Do the permutations
permtest_combinations <- apply(combinations, 2, function(x)
  mean(jackal$mandible_length_mm[x]) - mean(jackal$mandible_length_mm[-x]))
## Full Permutation test p.value
p_val <- length(permtest_combinations[abs(permtest_combinations) >= jackal_mean_diff]) / choose(20,10)
p_val
## [1] 0.003334127

Rather than considering all possible combinations we might rather use 100 random permutations under the null hypothesis. Here, we sample without replacement:

## set up matrix
random_perm <- apply(matrix(0, nrow = 99, ncol = 1), 1, function(x) sample(20))
random_mean_diff <- apply(random_perm, 2, function(x){
  z <- jackal$mandible_length_mm[x]
  mean(z[jackal$sex == "Male"]) - mean(z[jackal$sex == "Female"])
})
## add the observed 
random_mean_diff <- c(random_mean_diff, jackal_mean_diff)
random_p.value <- length(random_mean_diff[abs(random_mean_diff) >= jackal_mean_diff]) / 100 ## note the abs()
random_p.value
## [1] 0.01

How does this compare to the exact (permutation) results above? Try increasing the number of times you randomise? What happens as you increase it?

A randomisation test: Pāua shell length

The paua dataset contains the following variables

  • Age of P\(\overline{\text{a}}\)ua in years (calculated from counting rings in the cone)
  • Length of P\(\overline{\text{a}}\)ua shell in centimeters
  • Species of P\(\overline{\text{a}}\)ua: Haliotis iris (typically found in NZ) and Haliotis australis (less commonly found in NZ)
glimpse(paua)
## Rows: 60
## Columns: 3
## $ Species <chr> "Haliotis iris", "Haliotis australis", "Haliotis australis", "…
## $ Length  <dbl> 1.80, 5.40, 4.80, 5.75, 5.65, 2.80, 5.90, 3.75, 7.20, 4.25, 6.…
## $ Age     <dbl> 1.497884, 11.877010, 5.416991, 4.497799, 5.500789, 2.500972, 6…

One question we may want to ask is if on average the shell length differs between Species?

Scientific question: Are the shell lengths of shells the same in both species? Null hypothesis: The distribution of shell lengths in Haliotis iris the same as in Haliotis australis Test statistic: Difference of sample means

But because the data are skewed and we’ve likely got non-constant variances we may be better off adopting a randomization test (this time using a for loop!), rather than a parametric t-test

1. Choose a statistic that measures the effect of interest (in this case the differences between means).

## observed differences in means
diff_in_means <- (paua %>% group_by(Species) %>%
                    summarise(mean = mean(Length)) %>% 
                    summarise(diff = diff(mean)))$diff
diff_in_means
## [1] -0.9569444

2. Construct the sampling distribution that this statistic would have if the effect were not present in the population (i.e., the distribution under \(H_0\)) .

## Number of times I want to randomise
nreps <- 999   
## initialize empty array to hold results
randomisation_difference_mean <- numeric(nreps)
set.seed(1234) ## *****Remove this line for actual analyses*****
## This means that each run with produce the same results and
## agree with the printout that I show.

for (i in 1:nreps) {
  ## the observations
  data <- data.frame(value = paua$Length)
  ##  randomise labels
  data$random_labels <-sample(paua$Species, replace = FALSE)
  ## randomised differences in mean
  randomisation_difference_mean[i] <- data %>% 
    group_by(random_labels) %>% summarise(mean = mean(value)) %>% 
    summarise(diff = diff(mean)) %>%
    as.numeric()
}
## results, join randomised stats with the observed value
results <- data.frame(results = c(randomisation_difference_mean, diff_in_means))

3. Locate the observed statistic (i.e., from our observed random sample) in the sampling distribution

Let’s calculate how many randomised differences in means are as least as extreme as the one we observed. Note that we use the absolute value as dealing with a two tailed test.

n_exceed <- sum(abs(results$results) >= abs(diff_in_means))
n_exceed
## [1] 2
## proportion
n_exceed/nreps
## [1] 0.002002002

What do you conclude from this proportion? How does this tie in with the distribution of the test statistic under the null hypothesis shown below?

NOTE: We can extend the randomization test to make inference about any sample statistic (not just the mean)

TASK Carry out a randomisation test in place of the permutation test carried out for the jackal data. WHat do you conclude?

Bootstrap resampling

A sampling distribution shows us what would happen if we took very many samples under the same conditions. The bootstrap is a procedure for finding the (approximate) sampling distribution from just one sample.

The original sample represents the distribution of the population from which it was drawn. Resamples, taken with replacement from the original sample are representative of what we would get from drawing many samples from the population (the distribution of the statistics calculated from each resample is known as the bootstrap distribution of the statistic). The bootstrap distribution of a statistic represents that statistic’s sampling distribution.

TASK Study this cheatsheet and link the relevant sections to each step given above.

Example: constructing bootstrap confidence intervals

Old faithful is a gyser located in Yellowstone National Park, Wyoming. Below is a histogram of the duration of 299 consecutive eruptions. Clearly the distribution is bimodal (has two modes)!

ggplot(data = geyser, aes(x = duration)) +
  geom_histogram() +
  xlab("Duration of eruptions (m)") +
  theme_linedraw()

Step 1: Calculating the observed mean eruption duration time:

mean <- geyser %>%
  summarise(mean = mean(duration))
mean
##       mean
## 1 3.460814

Step 2: Construct bootstrap distribution

## Number of times I want to bootstrap
nreps <- 1000   
## initialize empty array to hold results
bootstrap_means <- numeric(nreps)
set.seed(1234) ## *****Remove this line for actual analyses*****
## This means that each run with produce the same results and
## agree with the printout that I show.
for (i in 1:nreps) {
  ## bootstrap. note with replacement
  bootstrap_sample <- sample(geyser$duration, replace = TRUE)
  ##  bootstrapped mean resample
  bootstrap_means[i] <- mean(bootstrap_sample)
}

Step 3: Collate the sampling distribution

## results
results <- data.frame(bootstrap_means = bootstrap_means)
ggplot(data = results, aes(x = bootstrap_means)) +
  geom_histogram() +
  geom_vline(xintercept = as.numeric(mean)) +
  ggtitle("Sampling distribution of the mean") +
  xlab("Bootstrap means")  + ylab("") + theme_classic()

Step 4: Calculate quantities of interest from the sampling distribution

  1. The bootstrap estimate of bias is the difference between the mean of the bootstrap distribution and the value of the statistic in the original sample:
bias <- as.numeric(mean) - mean(results$bootstrap_means)
bias
## [1] 0.001200111
  1. The bootstrap standard error of a statistic is the standard deviation of its bootstrap distribution:
sd(results$bootstrap_means)
## [1] 0.06740607
## compare to SEM of original data
 MASS::geyser %>%
  summarise(sem = sd(duration)/sqrt(length(duration)))
##          sem
## 1 0.06638498
  1. A Bootstrap \(t\) confidence interval. If, for a sample of size \(n\), the bootstrap distribution is approximately Normal and the estimate of bias is small then an approximate \(C\) confidence for the parameter corresponding to the statistic is:

\[\text{statistic} \pm t^* \text{SE}_\text{bootstrap}\] where \(t*\) is the critical value of the \(t_{n-1}\) distribution with area \(C\) between \(-t^*\) and \(t^*\).

For \(C = 0.95\):

as.numeric(mean) + c(-1,1) * qt(0.975,298)*sd(results$bootstrap_means)
## [1] 3.328162 3.593466

So our 95% confidence interval is 3.3 to 3.6.

  1. A bootstrap \(percentile\) confidence interval. Use the bootstrap distribution itself to determine the limits of the confidence interval by taking the limits of the sorted, central \(C\) bulk of the distribution. For \(C = 0.95\):
sort(results$bootstrap_means)[c(25,975)]
## [1] 3.328428 3.591081

Resampling procedures, the differences

Resampling methods are any of a variety of methods for doing one of the following

  1. Estimating the precision of sample statistics (e.g., bootstrapping)
  2. Performing significance tests (e.g., permutation/exact/randomisation tests)
  3. Validating models (e.g., bootstrapping, cross validation)

Permutation vs bootstrap test

  • The permutation test exploits symmetry under the null hypothesis.

  • A full permutation test p-value is exact, conditional on data values in the combined sample.

  • A bootstrap estimates the probability mechanism that generated the samples under the null hypothesis.

  • A bootstrap does not rely on any special symmetry or assumption or exchangeability.

Parametric hypothesis testing

The vocabulary of hypothesis testing

Type I and Type II errors

“Type I Zoom error: you think people can hear you, but you’re actually on mute. Type II Zoom error: you think your muted, but actually people can hear you.”
@CynPeacock, Twitter

A Type I error (false positive): declare a difference (i.e., reject \(H_0\)) when there is no difference (i.e. \(H_0\) is true). Risk of the Type I error is determined by the level of significance (which we set!) (i.e., \(\alpha =\text{ P(Type I error)} = \text{P(false positive)}\).

A Type II error (false negative): difference not declared (i.e., \(H_0\) not rejected) when there is a difference (i.e., \(H_0\) is false). Let \(\beta =\) P(do not reject \(H_0\) when \(H_0\) is false); so, \(1-\beta\) = P(reject \(H_0\) when \(H_0\) is false) = P(a true positive), which is the statistical power of the test.

Each time we carry out a hypothesis test the probability we get a false positive result (type I error) is given by \(\alpha\) (the level of significance we choose). The significance level is the probability of a Type I error (i.e., the probability of finding an effect that is not there, a false positive). The power of a test is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. The probability of finding an effect that is there = 1 - probability of a Type II error (false negative).

Reducing the chance of a Type I error increases the chance of a Type II error. They are inversely related. Type II error rate is determined by a combination of the following.

  • Effect size (size of difference, of biological significance) between the true population parameters
  • Experimental error variance
  • Sample size
  • Choice of Type I error rate (\(\alpha\))

When we have multiple comparisons to make we should then control the Type I error rate across the entire family of tests under consideration, i.e., control the Family-Wise Error Rate (FWER); this ensures that the risk of making at least one Type I error among the family of comparisons in the experiment is \(\alpha\). More on this in later chapters.

State of Nature Don’t reject \(H_0\) reject \(H_0\)
\(H_0\) is true Type I error
\(H_0\) is false Type II error

p-values

“Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”
— ASA Statement on p-Values

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

There are many different schools of thought about how a p-value should be interpreted.

Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis. The smaller the p-value, the stronger the evidence against \(H_0\).

Some people go further and use an accept/reject framework. Under this framework, the null hypothesis \(H_0\) should be rejected if the p-value is less than 0.05 (say), and accepted if the p-value is greater than 0.05. In this course we mostly use the strength of evidence interpretation. The p-value measures how far out our observation lies in the tails of the distribution specified by \(H_0\):

In summary, p-values can indicate how incompatible the data are with a specified statistical model; they do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Note that a p-value does not measure the size of an effect or the importance of a result and by itself it does not provide a good measure of evidence regarding a model or hypothesis. Note also, that a substantial evidence of a difference does not equate to evidence of a substantial difference! Any scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold as proper inference requires full reporting and transparency.

Remember that statistical significance does not imply practical significance, and that statistical significance says nothing about the size of treatment differences. To estimate the sizes of differences you need confidence intervals, look out for these in the following chapters.

Some p-value threshold recursive FAQs

Question Answer
Why do so many colleges and grad schools teach p-val=0.05? Because that’s still what the scientific community and journal editors use. BUT IT SHOULDN’T BE
Why do so many people still use p-val=0.05? Because that’s what they were rote taught in college or grad school. BUT THEY SHOULDN’T BE
TASK Briefly outline why a threshold cutoff is problematic in the context of hypothesis testing

Example

Recall the paua dataset, which contains the following variables

  • Age of P\(\overline{\text{a}}\)ua in years (calculated from counting rings in the cone)
  • Length of P\(\overline{\text{a}}\)ua shell in centimeters
  • Species of P\(\overline{\text{a}}\)ua: Haliotis iris (typically found in NZ) and Haliotis australis (less commonly found in NZ)

A one-sample t-test

Using a violin plot we can look at the distribution of shell Length. We can calculate the average Length of all shells in our sample

paua %>% summarise(average_length = mean(Length))
## # A tibble: 1 × 1
##   average_length
##            <dbl>
## 1           5.19

What about drawing inference? Do we believe that the average length of P\(\overline{\text{a}}\)ua shells is, say, 5cm? We know our sample average, but can we make any claims based on this one number?

How do we reflect our uncertainty about the population mean? Remember, it’s the population we want to make inference on based on our sample!

Enter the Standard Error of the Mean (SEM), \[\text{SEM} = \frac{\sigma}{\sqrt{n}}, \] where \(\sigma = \sqrt{\frac{\Sigma_{i = 1}^n(x_i - \bar{x})^2}{n-1}}\) (\(i = 1,...,n\)) is the standard deviation (SD) of the sample, \(n\) is the sample size, and \(\bar{x}\) is the sample mean.

Calculating \(\Sigma_{i = 1}^n(x_i - \bar{x})^2, i = 1,...,n\) by hand.

In, relatively, plain English this sum is the sum squared differences of the distances between the \(i^{th}\) observation and the sample mean \(\bar{x}\) (denoted \(\mu_x\) in the GIF below)

So using the example values in the GIF

## our sample of values
x <- c(1,2,3,5,6,9)
## sample mean
sample_mean <- mean(x)
sample_mean
## [1] 4.333333
## distance from mean for each value
distance_from_mean <- x - sample_mean
distance_from_mean
## [1] -3.3333333 -2.3333333 -1.3333333  0.6666667  1.6666667  4.6666667
## squared distance from mean for each value
squared_distance_from_mean <- distance_from_mean^2
squared_distance_from_mean
## [1] 11.1111111  5.4444444  1.7777778  0.4444444  2.7777778 21.7777778
## sum of the squared distances
sum(squared_distance_from_mean)
## [1] 43.33333

Calculating SD and SEM

Now what about the SD? Remember it’s the \(\sqrt{\frac{\Sigma_{i = 1}^n(x_i - \bar{x})^2}{n-1}}\) so = \(\sqrt{\frac{43.3333333}{n-1}}\) = \(\sqrt{\frac{43.3333333}{6-1}}\) = \(\sqrt{\frac{43.3333333}{5}}\) = 2.9439203.

Or we could just use R’s sd() function

sd(x)
## [1] 2.94392

So the SEM is \(\frac{\text{SD}}{\sqrt{n}}\) = \(\frac{2.9439203}{\sqrt{6}}\)

In R

sd(x)/sqrt(length(x))
## [1] 1.20185

For the paua data we can simply use the in-built functions in R to calculate the SEM

sem <- paua %>% summarise(mean = mean(Length),
                   sem = sd(Length)/sqrt(length(Length)))
sem
## # A tibble: 1 × 2
##    mean   sem
##   <dbl> <dbl>
## 1  5.19 0.155

Visualising the uncertainty

Recall that the SEM is a measure of uncertainty about the mean. So we can use it to express our uncertainty visually. Typically \(\pm\) twice the SEM is the interval used:

Why error bars that are \(\pm\) twice the SEM?

This is approximately the 95% confidence interval for the population mean (see lecture)

The exact 95% CI is given by \(\bar{x}\) (mean) \(\pm\) \(t_{df,1 - \alpha/2}\) \(\times\) SEM

  • df = degrees of freedom (in this situation df = n - 1)
  • \(\alpha\) = level of significance

Each mean has its own confidence interval whose width depends on the SEM for that mean

When the df (more on these later) are large (e.g. 30 or greater) and \(\alpha\) = 0.05 \(t_{df,1 - \alpha/2}\) = \(t_{large,0.975}\) \(\approx\) 2. Hence, the 95% confidence interval for the population mean is approximately \(\bar{x}\) (mean) \(\pm\) 2 \(\times\) SEM

Back to our hypothesis test

Question: Do we believe that the average length of P\(\overline{\text{a}}\)ua shells is 5cm?

Formalizing into a hypothesis test:

  • Null hypothesis: On average P\(\overline{\text{a}}\)ua shells are 5cm long
  • Alternative hypothesis: On average P\(\overline{\text{a}}\)ua shells are not 5cm long
  • Notationally: \(H_0: \mu = 5\) vs \(H_1: \mu \neq 5\) (\(\mu\) is the proposed mean)

Calculating a statistic (we use a t-statistic)

t-statistic \(= \frac{\bar{x}- \mu}{\text{SEM}}\) = \(\frac{5.1925 - 5}{0.155351}\) = 1.239

  • \(\bar{x}\) is the sample mean

  • \(\mu\) is the theoretical value (proposed mean)

The corresponding p-value

Recall that a p-value is the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value

So in this case it’s the probability, under the null hypothesis (\(\mu = 5\)), that we would observe a statistic as least as extreme as we did.

Under our null hypothesis the distribution of the t-statistic is as below. The one calculated from our hypothesis test was 1.2391. Now, remember that our alternative hypotheses was \(H_1: \mu \neq 5\) so we have to consider both sides of the inequality; hence, anything as least as extreme is either \(> 1.2391\) or \(< -1.2391\) to our observed statistic (vertical lines). Anything as least as extreme is therefore given by the grey shaded areas.

We can calculate the p-value using the pt() function (where q is our calculated t-statistic, and df are the degrees of freedom from above):

2*(1 - pt(q  = 1.2391,df = 59))
## [1] 0.2202152

Or we could do all of the above in one step using R

t.test(paua$Length, mu = 5 )
## 
##  One Sample t-test
## 
## data:  paua$Length
## t = 1.2391, df = 59, p-value = 0.2202
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  4.881643 5.503357
## sample estimates:
## mean of x 
##    5.1925

Recall, that the p-value gives the probability that under our null hypothesis we observe anything as least as extreme as what we did (hence the \(\times 2\), think of the grey shaded area in the graph). This probability is \(\sim\) 22%. Do you think what we’ve observed is likely under the null hypothesis?

Does this plot help? The proposed mean is shown by the blue horizontal line; the dashed line shows the sample mean and the dotted lines \(\pm\) the SEM.

Differences between two means

Calculating the differences between species means:

Haliotis australis average - Haliotis iris average = \(\mu_{\text{Haliotis australis}} - \mu_{\text{Haliotis iris}}\) = 5.767 - 4.81 = 0.957. Doesn’t really tell us much…

As above the average values are all well and good, but what about variation? Recall the SEM from the one-sample t-test? The same idea holds here, although the calculation is a little bit more complicated (as we have to think about the number of observations in each group). But from the two group SEMs we can calculate the Standard Error of the Difference between two means, SED.

An independent samples t-test using t.test()

Question: Do we believe that on average the length of P\(\overline{\text{a}}\)ua shells are equal between species?

Formalizing into a hypothesis test:

  • Null hypothesis: On average the species’ shells are the same length
  • Alternative hypothesis: they aren’t!
  • Notationally: \(H_0: \mu_{\text{Haliotis iris}} - \mu_{\text{Haliotis australis}} = 0\) vs \(H_1: \mu_{\text{Haliotis iris}} \neq \mu_{\text{Haliotis australis}},\) where \(\mu_{j}\) is the average length for species, \(j =\) (Haliotis iris, Haliotis australis).

Let us now calculate the test statistic: t-statistic = \(\frac{\bar{x}_{\text{difference}} - \mu}{\text{SED}}\) = \(\frac{\bar{x}_{\text{difference}} - 0}{\text{SED}}n\) where \(\bar{x}_{\text{difference}}\) is the differences between the species` averages.

Calculations area a little bit more tricky here so let’s use R:

test <- t.test(Length ~ Species, data = paua)
## printing out the result
test
## 
##  Welch Two Sample t-test
## 
## data:  Length by Species
## t = 3.5404, df = 57.955, p-value = 0.0007957
## alternative hypothesis: true difference in means between group Haliotis australis and group Haliotis iris is not equal to 0
## 95 percent confidence interval:
##  0.4158802 1.4980086
## sample estimates:
## mean in group Haliotis australis      mean in group Haliotis iris 
##                         5.766667                         4.809722
test$p.value
## [1] 0.0007956853

Listed are the t-statistic, t = 3.5403636 and the p-value, p-value = 7.96^{-4} for the hypothesis test outlined above. What would you conclude?

TASK Are you a fan of NFL or Taylor Swift, or maybe both. Download the following data and answer, if you can, How much better is Chiefs star with his girlfriend in attendance?

base_url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/"
data <- read_csv(paste(base_url, "swift_kelce.csv", sep = ""))

One-Way Analysis of Variance (ANOVA)

What if we had more than two groups?

In this section we consider the rats data, which contains calculated logAUC for 12 rats subjected to three different treatments (Surgery), C, P, and S.

glimpse(rats)
## Rows: 12
## Columns: 3
## $ Surgery <fct> C, C, C, C, P, P, P, P, S, S, S, S
## $ Rat     <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
## $ logAUC  <dbl> 8.49, 8.20, 9.08, 8.07, 10.24, 7.72, 9.34, 8.50, 11.31, 12.69,…
Between group SS (SSB)

The idea: Assess distances between treatment (surgical condition) means relative to our uncertainty about the actual (true) treatment means.

Add up the differences: -1.192 + -0.703 + 1.895 = 0. This is always the case!

So adding up the differences: -1.192 + -0.703 + 1.895 = 0. Not a great way to measure distances!

Sums of Squares?

-1.192^2 + -0.703^2 + 1.895^2

add up the squared differences? but… there are 4 observations in each group (treatment)

4\(\times\)(-1.192)^2 + 4\(\times\)(-0.703)^2 + 4\(\times\)(1.895)^2

This is the Between Groups Sums of Squares or the Between group SS (SSB)

So the Between group SS (SSB) = 22.02635

Adding up the differences: -1.192 + -0.703 + 1.895 = 0. This is always the case and that itself gives us information…

We only need to know two of the values to work out the third!

So we have only 2 bits of unique information; SSB degrees of freedom = 2

Within group SS (SSW)

The Within group SS (SSW) arises from the same idea:

To assess distances between treatment (surgical condition) means relative to our uncertainty about the actual (true) treatment means.

Procedure:

  • Observation - Treatment mean
  • Square the difference
  • Add them up!

Within group SS (SSW) unexplained variance

F-statistic

Recall the Between group SS (SSB) = 22.02635

So mean SSB = 22.02635 / 2

The within group SS (SSW) = 6.059075

Here we have 3\(\times\) 3 bits of unique information: within groups degrees of freedom is 9.

So mean SSW = 6.059/9

Consider the ratio \({\frac{{\text{variation due to treatments}}}{{\text{unexplained variance}}}} = {\frac{{\text{ mean between-group variability}}}{{\text{mean within-group variability}}}}\) \(=\frac{\text{mean SSB}}{\text{mean SSW}}\) \(=\frac{\text{MSB}}{\text{MSW}}\) = \(=\frac{\text{experimental variance}}{\text{error variance}}\) 16.3586975

This is the F-statistic!

Degrees of freedom (DF)

Essentially statistical currency (i.e., unique bits of information). So in the example above we have 3 treatment groups and if we know the mean of two we know the third (i.e., 2 unique bits of info) so SSB df = 2.

Now, for SSW df We have 12 observations (4 in each group); we know the treatment means so if we have three of those observed values in each group we know the fourth: 12 - 3 = 9 (i.e., number of observations - number of df lost due to knowing the cell means).

Inference

Hypothesis: We test the Null hypothesis, \(H_0\), population (Surgery) means are the same on average verses the alternative hypothesis, \(H_1\), that at least one differs from the others!

Under our null hypothesis the distribution of the F-statistic is as below. The value calculated from our hypothesis test was 16.359. This observed test statistic is shown by the vertical line. To calculate the relevant p-value we can use

(1 - pf(q  = 16.351,df1 = 2, df2 = 9))
## [1] 0.001007825

Thus, the probability of getting an F-statistic at least as extreme as the one we observe (think of the area under the tails of the curve below) p-value Pr(>F)= 0.001 tells us we have sufficient evidence to reject \(H_0\) at the 1% level of significance.

Alternatively, we could do this in one step using aov():

summary(aov(logAUC ~ Surgery, data = rats))
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Surgery      2 22.026  11.013   16.36 0.00101 **
## Residuals    9  6.059   0.673                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear regression

Model formula syntax in R

In R to specify the model you want to fit you typically create a model formula object; this is usually then passed as the first argument to the model fitting function (e.g., lm()).

Some notes on syntax:

Consider the model formula example y ~ x + z + x:z. There is a lot going on here:

  • The variable to the left of ~ specifies the response, everything to the right specify the explanatory variables
  • + indicated to include the variable to the left of it and to the right of it (it does not mean they should be summed)
  • : denotes the interaction of the variables to its left and right

Additional, some other symbols have special meanings in model formula:

  • * means to include all main effects and interactions, so a*b is the same as a + b + a:b

  • ^ is used to include main effects and interactions up to a specified level. For example, (a + b + c)^2 is equivalent to a + b + c + a:b + a:c + b:c (note (a + b + c)^3 would also add a:b:c)

  • - excludes terms that might otherwise be included. For example, -1 excludes the intercept otherwise included by default, and a*b - b would produce a + a:b

Mathematical functions can also be directly used in the model formula to transform a variable directly (e.g., y ~ exp(x) + log(z) + x:z). One thing that may seem counter intuitive is in creating polynomial expressions (e.g., x2). Here the expression y ~ x^2 does not relate to squaring the explanatory variable x (this is to do with the syntax ^ you see above. To include x^2 as a term in our model we have to use the I() (the “as-is” operator). For example, y ~ I(x^2)).

Traditional name Model formula R code
Simple regression \(Y \sim X_{continuous}\) lm(Y ~ X)
One-way ANOVA \(Y \sim X_{categorical}\) lm(Y ~ X)
Two-way ANOVA \(Y \sim X1_{categorical} + X2_{categorical}\) lm(Y ~ X1 + X2)
ANCOVA \(Y \sim X1_{continuous} + X2_{categorical}\) lm(Y ~ X1 + X2)
Multiple regression \(Y \sim X1_{continuous} + X2_{continuous}\) lm(Y ~ X1 + X2)
Factorial ANOVA \(Y \sim X1_{categorical} * X2_{categorical}\) lm(Y ~ X1 * X2) or lm(Y ~ X1 + X2 + X1:X2)

Some mathematical notation

Let’s consider a linear regression with a simple explanatory variable:

\[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where

\[\epsilon_i \sim \text{Normal}(0,\sigma^2).\]

Here for observation \(i\)

  • \(Y_i\) is the value of the response
  • \(x_i\) is the value of the explanatory variable
  • \(\epsilon_i\) is the error term: the difference between \(Y_i\) and its expected value
  • \(\alpha\) is the intercept term (a parameter to be estimated), and
  • \(\beta_1\) is the slope: coefficient of the explanatory variable (a parameter to be estimated)

When we fit a linear model we make some key assumptions:

  • Independence
  • There is a linear relationship between the response and the explanatory variables
  • The residuals have constant variance
  • The residuals are normally distributed

In Module 4 you will learn that we have tools that enable us to relax some of these assumptions, after all these assumptions only very rarely hold for real-world data.

In the sections below we use the penguins data from the palmerpenguins package to fit a number of different models using bill Depth (mm) of penguins as the response variable. We first remove the NA values for the response variable:

penguins <- filter(penguins, !is.na(penguins$bill_depth_mm))

A Null model

ggplot(data = penguins, aes(x = bill_depth_mm)) +
  geom_histogram() + theme_classic() +
  xlab("Bill depth (mm)")

Let’s fit a null (intercept only) model. This in old money would be called a one sample t-test.

slm_null <- lm(bill_depth_mm ~ 1, data = penguins)
summary(slm_null)$coef
##             Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 17.15117  0.1067846 160.6147 1.976263e-323

Model formula

This model, from above, is simply \[Y_i = \alpha + \epsilon_i.\]

Here for observation \(i\) \(Y_i\) is the value of the response (bill_depth_mm) and \(\alpha\) is a parameter to be estimated (typically called the intercept).

Inference

The (Intercept) term, 17.1511696, tells us the (estimated) average value of the response (bill_depth_mm), see

penguins %>% summarise(average_bill_depth = mean(bill_depth_mm))
## # A tibble: 1 × 1
##   average_bill_depth
##                <dbl>
## 1               17.2

The SEM (Std. Error) = 0.1067846.

The hypothesis being tested is \(H_0:\) ((Intercept) ) \(\text{mean}_{\text{`average_bill_depth`}} = 0\) vs. \(H_1:\) ((Intercept)) \(\text{mean}_{\text{`average_bill_depth`}} \neq 0\)

The t-statistic is given by t value = Estimate / Std. Error = 160.6146593

The p-value is given byPr (>|t|) = 2^{-323}.

So the probability of observing a t-statistic as least as extreme given under the null hypothesis (average bill depth = 0) given our data is 2^{-323}, pretty strong evidence against the null hypothesis I’d say!

Single continuous variable

Does bill_length_mm help explain some of the variation in bill_depth_mm?

p1 <- ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() + ylab("Bill depth (mm)") +
  xlab("Bill length (mm)") + theme_classic()
p1

slm <- lm(bill_depth_mm ~ bill_length_mm, data = penguins)

Model formula

This model is simply \[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where for observation \(i\) \(Y_i\) is the value of the response (bill_depth_mm) and \(x_i\) is the value of the explanatory variable (bill_length_mm); As above \(\alpha\) and \(\beta_1\) are parameters to be estimated. We could also write this model as

\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm}) + \epsilon \end{aligned} \]

Fitted model

As before we extract the estimated parameters (here \(\alpha\) and \(\beta_1\)) using

summary(slm)$coef
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    20.88546832 0.84388321 24.749240 4.715137e-78
## bill_length_mm -0.08502128 0.01906694 -4.459093 1.119662e-05

Here, the (Intercept): Estimate (\(\alpha\) above) gives us the estimated average bill depth (mm) given the estimated relationship bill length (mm) and bill length.

The bill_length_mm : Estimate (\(\beta_1\) above) is the slope associated with bill length (mm). So, here for every 1mm increase in bill length we estimated a 0.085mm decrease (or a -0.085mm increase) in bill depth.

## calculate predicted values
penguins$pred_vals <- predict(slm)
## plot
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() + ylab("Bill depth (mm)") +
  xlab("Bill length (mm)") + theme_classic() +
  geom_line(aes(y = pred_vals))

A single factor variable

Does bill_depth_mm vary between species? Think of ANOVA.

p1 <- ggplot(data = penguins, aes(x = species, y = bill_depth_mm)) +
  geom_violin() + ylab("Bill depth (mm)") +
  xlab("Species") + theme_classic() + stat_summary(stat = "mean")
p1

sflm <- lm(bill_depth_mm ~ species, data = penguins)

Model formula

This model is simply \[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where for observation \(i\), \(Y_i\) is the value of the response (bill_depth_mm) and \(x_i\) is the value of the explanatory variable (species). However, species is a factor variable. This means we need to use dummy variables:

\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \epsilon \end{aligned} \]

Fitted model

The estimated parameters (here \(\alpha\), \(\beta_1\), and \(\beta_2\)) are

summary(sflm)$coef
##                     Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)      18.34635762 0.09121134 201.1411914 0.000000e+00
## speciesChinstrap  0.07423062 0.16368785   0.4534889 6.504869e-01
## speciesGentoo    -3.36424379 0.13613555 -24.7124554 7.929020e-78

Here, the (Intercept): Estimate (\(\alpha\) above) gives us the estimated average bill depth (mm) of the baseline species Adelie (R does this alphabetically, to change this see previous chapter) and the remaining coefficients are the estimated difference between the baseline and the other categories (species). The Std. Error column gives the standard error for these estimated parameters (in the case of the group differences there are the SEDs, standard error of the difference).

## calculate predicted values
penguins$pred_vals <- predict(sflm)
## plot
ggplot(data = penguins, aes(x = species, y = bill_depth_mm, fill = species)) +
  geom_violin() + ylab("Bill depth (mm)") +
  xlab("Species") + theme_classic() +
  geom_hline(aes(yintercept = pred_vals, col = species), linewidth = 2) +
    stat_summary(stat = "mean")

One factor and a continuous variable

p2 <- ggplot(data = penguins,
             aes(y = bill_depth_mm, x = bill_length_mm, color = species)) +
  geom_point() + ylab("Bill depth (mm)") +
  xlab("Bill length (mm)") + theme_classic()
p2

slm_sp <- lm(bill_depth_mm ~ bill_length_mm + species, data = penguins)

Model formula

Now we have two explanatory variables, so our model formula becomes

\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i\] \[\epsilon_i \sim \text{Normal}(0,\sigma^2)\]

where for observation \(i\)

  • \(Y_i\) is the value of the response (bill_depth_mm)
  • \(z_i\) is one explanatory variable (bill_length_mm say)
  • \(x_i\) is another explanatory variable (species say)
  • \(\epsilon_i\) is the error term: the difference between \(Y_i\) and its expected value
  • \(\alpha\), \(\beta_1\), and \(\beta_2\) are all parameters to be estimated.

Here, again, the Adelie group are the baseline.

So model formula is

\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm}) + \beta_{2}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{3}(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad \epsilon \end{aligned} \]

Fitted model

summary(slm_sp)$coef
##                    Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      10.5921805 0.68302332  15.50779 2.425034e-41
## bill_length_mm    0.1998943 0.01749365  11.42668 8.661124e-26
## speciesChinstrap -1.9331943 0.22416005  -8.62417 2.545386e-16
## speciesGentoo    -5.1060201 0.19142432 -26.67383 3.650225e-85

Simpson’s paradox… look how the slope associated with bill length (coefficient of bill_length_mm) has switched direction from the model above! Why do you think this is?

Here, the (Intercept): Estimate gives us the estimated average bill depth (mm) of the Adelie penguins given the estimated relationship between bill length and bill depth. Technically, this is the estimated bill depth (mm) for Adelie penguins with zero bill length. That is clearly a nonsense way to interpret this as that would be an impossible situation in practice! I would recommend as thinking of this as the y-shift (i.e., height) of the fitted line.

The bill_length_mm : Estimate (\(\beta_1\) above) is the slope associated with bill length (mm). So, here for every 1mm increase in bill length we estimated a 0.2mm increase in bill depth.

What about the coefficient of the other species levels? Look at the plot below, these values give the shift (up or down) of the parallel lines from the Adelie level. So given the estimated relationship between bill depth and bill length these coefficients are the estimated change from the baseline.

## calculate predicted values
penguins$pred_vals <- predict(slm_sp)
## plot
ggplot(data = penguins, aes(y = bill_depth_mm, x = bill_length_mm, color = species)) +
  geom_point() + ylab("Bill depth (mm)") +
  xlab("Bill length (mm)") + theme_classic()  +
  geom_line(aes(y = pred_vals))

Interactions

Recall the (additive) model formula from above

\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i\]

but what about interactions between variables? For example,

\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \beta_3z_ix_i + \epsilon_i\]

Note: to include interaction effects in our model by using either the * or : syntax in our model formula. For example,

  • : denotes the interaction of the variables to its left and right, and

  • * means to include all main effects and interactions, so a*b is the same as a + b + a:b.

To specify a model with additive and interaction effects we use

slm_int <- lm(bill_depth_mm ~ bill_length_mm*species, data = penguins)

Model formula

The model formula is then

\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm}) + \beta_{2}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{3}(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{5}(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Gentoo}}) + \epsilon \end{aligned} \]

Fitted model

summary(slm_int)$coef
##                                    Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)                     11.40912448 1.13812137 10.0245235 7.283046e-21
## bill_length_mm                   0.17883435 0.02927108  6.1095922 2.764082e-09
## speciesChinstrap                -3.83998436 2.05398234 -1.8695313 6.241852e-02
## speciesGentoo                   -6.15811608 1.75450538 -3.5098873 5.094091e-04
## bill_length_mm:speciesChinstrap  0.04337737 0.04557524  0.9517750 3.418952e-01
## bill_length_mm:speciesGentoo     0.02600999 0.04054115  0.6415701 5.215897e-01

This can be written out:

\[ \begin{aligned} \operatorname{\widehat{bill\_depth\_mm}} &= 11.41 + 0.18(\operatorname{bill\_length\_mm}) - 3.84(\operatorname{species}_{\operatorname{Chinstrap}}) - 6.16(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad 0.04(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Chinstrap}}) + 0.03(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Gentoo}}) \end{aligned} \]

The interaction terms (i.e., bill_length_mm:speciesChinstrap and bill_length_mm:speciesGentoo) specify the species specific slopes given the other variables in the model. We can use our dummy variable trick again to interpret the coefficients correctly. In this instance we have one factor explanatory variable, Species, and one continuous explanatory variable, bill length (mm). As before, Adelie is our baseline reference.

Let’s assume we’re talking about the Adelie penguins, then our equation becomes (using the dummy variable technique)

\[\widehat{\text{bill_depth_mm}} = 11.409 + (0.179 \times \text{bill_length_mm}).\]

So, the (Intercept): Estimate term (\(\hat{\alpha}\)), again, specifies the height of the Adelie fitted line and the main effect of bill_length_mm: Estimate (\(\hat{\beta_1}\)) estimates the relationship (slope) between bill length (mm) and bill depth (mm) for the Adelie penguin. So, here for every 1mm increase in bill length (mm) for the Adelie penguins we estimate, on average, a 0.179mm increase in bill depth (mm).

Now, what about Gentoo penguins? Our equation then becomes

\[\widehat{\text{bill_depth_mm}} = 11.409 + (0.179 \times \text{bill_length_mm}) + (-6.158) + (0.026 \times \text{bill_length_mm}).\]

which simplifies to

\[\widehat{\text{bill_depth_mm}} = 5.251 + (0.205 \times \text{bill_length_mm}).\]

The estimated Gentoo-specific intercept term (y-axis line height) is therefore \(\hat{\alpha} + \hat{\beta_3} = 11.409 + (-6.158) = 5.251.\) The Gentoo-specific bill length (mm) slope is then \(\hat{\beta_1} + \hat{\beta_5} = 0.179 + 0.026 = 0.205.\) So, for every 1mm increase in bill length (mm) for the Gentoo penguins we estimate, on average, a 0.205mm increase in bill depth (mm), a slightly steeper slope than for the estimated Adelie relationship (i.e., we estimate that as Gentoo bills get longer their depth increases at a, slightly, greater rate than those of Adelie penguins).

In summary, the main effect of species (i.e., speciesChinstrap: Estimate and speciesGentoo:Estimate ) again give the shift (up or down) of the lines from the Adelie level. However, these lines are no longer parallel (i.e., each species of penguin has a different estimated relationship between bill length and bill depth)!

But, now we’ve specified this all singing and dancing interaction model we might ask are the non-parallel lines non-parallel enough to reject the parallel line model? Look at the plot below; what do you think?

Other possible relationships

Let’s assume that we have the same data types as above, specifically, a continuous response (\(y\)) and one factor (\(f\), with two levels \(A\) and \(B\)) and one continuous (\(x\)) explanatory variable. Assuming that \(A\) is the baseline for \(f\) the possible models are depicted below.

Note that models III and V are forced to have the same intercept for both levels of \(f\). In addition, when you have no main effect of \(x\), models IV and V, then the model is forced to have no effect of \(x\) for the baseline level of \(f\) (in this case \(A\)).

Model diagnostics for a linear model

It is always is imperative that we check the underlying assumptions of our model! If our assumptions are not met then basically the maths falls over and we can’t reliably draw inference from the model (e.g., can’t trust the parameter estimates etc.). Two of the most important assumption are:

  • equal variances (homogeneity of variance), and
  • normality of residuals.

Let’s look at the fit of the slm model (single continuous explanatory variable), the gglm function from the gglm package produces a nice panel plot of our model residuals:

gglm(slm_sp)

Residuals vs Fitted plot

You are basically looking for no pattern or structure in your residuals (e.g., a “starry” night). You definitely don’t want to see is the scatter increasing around the zero line (dashed line) as the fitted values get bigger (e.g., think of a trumpet, a wedge of cheese, or even a slice of pizza) which would indicate unequal variances (heteroscedacity).

Normal quantile-quantile (QQ) plot

This plot shows the sorted residuals versus expected order statistics from a standard normal distribution. Samples should be close to a line; points moving away from 45 degree line at the tails suggest the data are from a skewed distribution.

Scale-Location plot (\(\sqrt{\text{|standardized residuals vs Fitted|}}\))

Another way to check the homoskedasticity (constant-variance) assumption. We want the line to be roughly horizontal. If this is the case then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. We’d also like the spread around the line not to vary much with the fitted values; then the variability of magnitudes doesn’t vary much as a function of the fitted values.

Residuals vs Leverage plot (standardized residuals vs Leverage)

This can help detect outliers in a linear regression model. For linear regression model leverage measures how sensitive a fitted value is to a change in the true response. We’re looking at how the spread of standardized residuals changes as the leverage. This can also be used to detect heteroskedasticity and non-linearity: the spread of standardized residuals shouldn’t change as a function of leverage. In addition, points with high leverage may be influential: that is, deleting them would change the model a lot.

Do you think the residuals are Normally distributed (look at the QQ plot)? Think of what this model is, do you think it’s the best we can do?

Model selection

Are the non-parallel lines non-parallel enough to reject the parallel line model?

Now we can compare nested linear models by hypothesis testing. Luckily the R function anova() automates this. Yes the same idea as we’ve previously learnt about ANOVA! We essentially perform an F-ratio test between the nested models!

By nested we mean that one model is a subset of the other (i.e., where some coefficients have been fixed at zero). For example,

\[Y_i = \beta_0 + \beta_1z_i + \epsilon_i\]

is a nested version of

\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i\] where \(\beta_2\) has been fixed to zero.

As an example consider testing the single explanatory variable model slm against the same model with species included as a variable slm_sp. To carry out the appropriate hypothesis test in R we can run

anova(slm,slm_sp)
## Analysis of Variance Table
## 
## Model 1: bill_depth_mm ~ bill_length_mm
## Model 2: bill_depth_mm ~ bill_length_mm + species
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    340 1256.4                                  
## 2    338  307.2  2    949.16 522.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you’ll see the anova() function takes the two model objects (slm and slm_sp) each as arguments. It returns an ANOVA testing whether the more complex model (slm_sp) is just as good at capturing the variation in the data as the simpler model (slm). The returned p-value should be interpreted as in any other hypothesis test. i.e., the probability of observing a statistic as least as extreme under our null hypothesis (here that each model is as good at capturing the variation in the data).

What would we conclude here? I’d say we have pretty strong evidence against the models being equally good! I’d definitely plump for slm_sp over slm, looking back at the plots above does this make sense?

Now what about slm_int vs slm_sp?

anova(slm_sp,slm_int)
## Analysis of Variance Table
## 
## Model 1: bill_depth_mm ~ bill_length_mm + species
## Model 2: bill_depth_mm ~ bill_length_mm * species
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    338 307.20                           
## 2    336 306.32  2   0.87243 0.4785 0.6202

So it seems both models are just as good at capturing the variation in our data: we’re happy with the parallel lines!

Another way we might compare models is by using the Akaike information criterion (AIC) (you’ll see more of this later in the course). AIC is an estimator of out-of-sample prediction error and can be used as a metric to choose between competing models. Between nested models we’re looking for the smallest AIC (i.e., smallest out-of-sample prediction error). Typically, a difference of 4 or more is considered to indicate an improvement; this should not be taken as writ however, using multiple comparison techniques is advised.

R already has an AIC() function that can be used directly on your lm() model object(s). For example,

AIC(slm,slm_sp,slm_int)
##         df       AIC
## slm      3 1421.5521
## slm_sp   5  943.8504
## slm_int  7  946.8778

This backs up what our ANOVA suggested model slm_sp as that preferred! As always it’s important to do a sanity check! Does this make sense? Have a look at the outputs from these models and see what you think.

TASK

Just because we’ve chosen a model (the best of a bad bunch perhaps) this doesn’t let us off the hook. We should check our assumptions. Use the plot below and briefly comment on whether you think the model assumptions are met.

gglm(slm_sp) 

Inference for a linear model

After all that what do estimated parameters mean? Assuming our model assumptions are met we can draw inference based on the estimated parameter values.

Point prediction

Using the slm_sp model we can make a point prediction for the expected bill depth (mm) for Gentoo penguins with a bill length of 50mm.

Recall the model equation is

\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm})\ + \\ &\quad \beta_{2}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{3}(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad \epsilon \end{aligned} \]

The fitted equation is given as

\[ \begin{aligned} \operatorname{\widehat{bill\_depth\_mm}} &= 10.59 + 0.2(\operatorname{bill\_length\_mm})\ - \\ &\quad 1.93(\operatorname{species}_{\operatorname{Chinstrap}}) - 5.11(\operatorname{species}_{\operatorname{Gentoo}}) \end{aligned} \]

We can then simply substitute in the values:

\[\widehat{\text{bill depth}} = \hat{\alpha} + \hat{\beta_1}*50 + \hat{\beta_3}*1\] \[\downarrow\]

\[\widehat{\text{bill depth}} = 10.56 + 0.20*50 - 5.10*1\] \[\downarrow\]

\[15.47\text{mm}\]

Rather than by hand we can do this easily in R

## create new data frame with data we want to predict to
## the names have to match those in our original data frame
newdata <- data.frame(species = "Gentoo",bill_length_mm = 50)
## use predict() function
predict(slm_sp, newdata = newdata) ## more accurate than our by hand version!
##        1 
## 15.48087

What does this look like on a plot?

Confidence intervals for parameters

For the chosen slm_sp model we can get these simply by using confint(). By default the 95% intervals are returned:

cis <- confint(slm_sp)
cis
##                       2.5 %     97.5 %
## (Intercept)       9.2486686 11.9356923
## bill_length_mm    0.1654842  0.2343044
## speciesChinstrap -2.3741187 -1.4922698
## speciesGentoo    -5.4825531 -4.7294870

So this tells us that For every 1mm increase in bill length we estimate the expected bill depth to increases between 0.165 and 0.234 mm

We estimate that the expected bill depth of a Chinstrap penguin is between 1.5 and 2.4 mm shallower than the Adelie penguin

Marginal predictions

One other thing we may want to do is give marginal precictions which show the marginal effect of a predictor, holding other variables constant.

The ggpredict() function from the ggeffects package can be used for this:

ggpredict(slm_sp, terms = "species")
## # Predicted values of bill_depth_mm
## 
## species   | Predicted |       95% CI
## ------------------------------------
## Adelie    |     19.37 | 19.14, 19.61
## Chinstrap |     17.44 | 17.16, 17.72
## Gentoo    |     14.27 | 14.06, 14.48
## 
## Adjusted for:
## * bill_length_mm = 43.92

We can even plot the marginal effects:

ggpredict(slm_sp, terms = "species") %>% 
  ggplot(aes(x = x, y = predicted, label = round(predicted, 3))) +
  geom_point() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .5) +
  geom_text(hjust = -.2) +
  ylab( "Predicted bill depth (mm)") + xlab("") +
  ggtitle("Fitted means for each species with 95% CIs", subtitle = "Adjusted for bill_length_mm = 43.99") +
  theme_classic()

Module 3

We now know how to deal with data in R; however, before we start drawing conclusions we need to know how the data were collected and more importantly. perhaps, collect data ourselves. Generally, data is either observational (data collected where researchers don’t control the environment, but simply observe outcomes) or experimental (data collected where researchers introduce some intervention and control the environment in order to draw inference).

TASK R. A. Fisher’s work in the area of experimental design is, perhaps, the most well known in the field. The principles he devised we still abide by today. Note, however, to give a balanced view of the celebrated mathematician many of his views (on eugenics and race in particular) are detested by many. I would urge you to read up on his work in this area and come to your own conclusions.

R packages and datasets

It is assumed in all following examples of this module that the following code has been executed successfully.

packages used in this module

library(tidyverse) ## general functionality
library(edibble) ## experiment setup
library(gglm) ## nice lm() diagnostic plots
library(emmeans) ## pairwise contrasts
library(lme4) ## mixed effects models
library(lmerTest) ## mixed effects models
library(predictmeans) ## pairwise comparisons

datasets used in this module

base_url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/"
  1. rats
rats <- read_csv(paste(base_url, "crd_rats_data.csv", sep = ""))
rats$Surgery <- as_factor(rats$Surgery)
  1. factorial
factorial <- read_csv(paste(base_url, "factorial_expt.csv", sep = ""))
  1. rcbd
rcbd <- read_csv(paste(base_url, "rcbd.csv", sep = ""))
  1. split_plot
split_plot <- read_csv(paste(base_url, "split_plot.csv", sep = ""))
  1. liver
liver <- read_csv(paste(base_url, "repeated_measures_liver.csv", sep = ""))

Introduction to the design and analysis of experiments

“A useful property of a test of significance is that it exerts a sobering influence on the type of experimenter who jumps to conclusions on scanty data, and who might otherwise try to make everyone excited about some sensational treatment effect that can well be ascribed to the ordinary variation in [their] experiment.”
— Gertrude Mary Cox

Key phrases

Key phrases

  • An experiment is a procedure (or set of actions) where a researcher intentionally changes some factor/treatment/variable to observe the effect of their actions. As mentioned above, the collection of observational data is not experimentation.

  • An experimental unit is the smallest portion of experimental material which is independently perturbed. This is the item under study for which some variable (treatment) is changed. For example this could be a human subject or an agricultural plot.

  • An observational unit (or subsample) is the smallest unit on which a response is measured. If the experimental unit is split after the treatment has been applied (e.g., multiple samples taken from one human subject) then this sample is called a subsample or observational unit. If one measurement is made on each experimental unit then the observational unit = the experimental unit. If multiple measurements are made on each subject (e.g., human) then each experimental unit has >1 observational unit. This is then pseudo- or technical replication (see below).

  • A treatment (or independent variable or factor or treatment factor) is an experimental condition independently applied to an experimental unit. It is one of the variables that is controlled by the researcher during the experiment (e.g., drug type). The values of the treatments within a set are called levels.

  • The dependent variable or response is the output (or thing) that is measured after an experiment. This is what the researcher measures and assesses if changing the treatment(s) (i.e., independent variable(s)) induces any change.

  • An effect is the change in the response variable caused by the controlled changes in the independent variable. Whether the magnitude of the effect (it’s size) is significant (or or any practical interest) is determined by the researcher after carrying out some appropriate analyses.

Setting up an experiment

In the previous chapter where we discussed [Reproducibility]. This also applies to the design and analysis of experiments. In order to future proof our research we should make every effort to ensure others can reproduce it. To do so we should be specific about our goals and procedures by

  1. Defining the goals/objectives of our research,
  2. Formulating a specific hypothesis,
  3. Specifying the response variable(s) that will be measured,
  4. Specifying the treatments that will be tested and describing the process of applying these treatments to the experimental units,
  5. Outlining the procedure for observing and recording responses to assess treatment effects,
  6. Identifying factors that may contribute to variability in the results (expected and otherwise), and
  7. Describing the statistical or methods that will be employed to the our hypothesis.

Following these guidelines not only helps our experiment, but makes sure others can reproduce it. For example, defining specific objectives directs you towards writing focused statements about the investigative questions you want your experiment to answer. Listing the experimental factors (or treatments or independent variables) you will study in your experiment helps to organize variables and work out how they may help to explain observed changes in your measurable response(s). It is important that the experimental factor can be controlled during and between experimental runs. Variables that are thought to affect the response, but cannot be controlled, cannot be included as an experimental factor.

An example

Experiment Title Barista Brews
Researcher Name Dr Java
Researcher Institute Central Perk
Objective The objective of this experiment is to determine what type of grind and brew ratio (i.e., amount of coffee in relation to water) leads to the strongest coffee.
Hypothesis That a finer grained coffee with a higher brew ratio (i.e., less water to coffee) will lead to the strongest coffee.
Response variable Total Dissolved Solids (TDS), a measure of soluble concentration (%).
Treatments A. Brew ratio i) 2 parts coffee to 1 part water (2-1) ii) 1 part coffee to 1 part water (1-1) iii) 1 part coffee to 2 parts water (1-2) B. Grind i) Fine ii) Coarse
Experimental material Individual cups, maintaining consistency in cup size, water amount, and boiling temperature.
How treatments will be applied Independently to individual cups. Each cup will be made independently, not as a batch distributed among several cups.
How measurements will be taken Treatments applied independently to individual cups. Each cup with the same treatment (brew ratio and grind) will be subject to the same environmental conditions.
Experimental units Replicate each treatment twice. Assign 2 experimental units (cups) to each unique treatment combination.
TASK Design your own experiment briefly outlining each of the steps listed above. You may find this application useful.

The three (main) principles of experimental design

  1. Replication

Repeating an experiment means that you can assess the consistency and reliability of any observed effects. If you observe the effect repeatedly is less likely that it occurred simply due to random variation. There are many ways replication can be included in an experiment. What you repeat (e.g., treatments or measurements) depends on what effect or source of variation you wish to investigate. The table below summarizes, perhaps, the three most commonly employed types of replication.

Replication Type Description Why
Biological Replication Each treatment is independently applied to several humans, animals, or plants. To generalize results to the population.
Technical Replication Two or more samples from the same biological source independently processed. Advantageous if processing steps introduce a lot of variation; increases precision in comparing relative abundances between treatments.
Pseudo-replication One sample from the same biological source divided into two or more aliquots independently measured. Advantageous for noisy measuring instruments; increases precision in comparing relative abundances between treatments.
  1. Randomization

Employing randomization in your experiment goes towards ensuring the validity, reliability, and generalizability of your results. The main reason to randomize allocation of treatment to experimental units is to protect against bias. We, typically, wish to plan the experiment in such a way that the variations caused by extraneous factors can all be combined under the general heading of “chance”. Doing so ensures that each treatment has the same probability of getting good (or bad) units and thus avoids systematic bias. Random allocation can cancel out population bias; it ensures that any other possible causes for the experimental results are split equally between groups. Hence, by creating comparable treatment groups through random assignment we can minimize bias, increase validity and genralizability of the results.

Typically statistical analysis assumes that observations are independent. This is almost never strictly true in practice but randomization means that our estimates will behave as if they were based on independent observations. Randomization is also useful in situations where we may be are unaware of all potential variables as it is more likely to distribute the effects of unknown factors evenly. In addition, randomizing treatment allocation can aid in the ethical conduct of our research as it promotes fair distribution of benefits and risks among participants.

  1. Blocking

Blocking helps control variability by making treatment groups more alike. Experimental units are divided into subsets (called blocks) so that units within the same block are more similar (homogeneous) than units from different subsets or blocks. The experiment is then conducted separately within each block. Blocking is a technique for dealing with nuisance factors, a factor that has some effect on the response, but is of no interest. By grouping similar units together in blocks we reduce the effect of the nuisance factors, which can improve the precision of our estimates (magnitude of effects). In addition, blocking helps prevent confounding by controlling the impact of known or suspected sources of variation.

Let’s briefly delve into the realms of fantasy6. The fantastical example we consider is set at Basgiath College where students are trained to serve the Empire. Students are either trained as Riders or Scribes entering the respective Quadrant. Along comes a new Professor, Professor Llewelyn, who thinks that their new dragon-riding coaching technique is guaranteed to result in huge improvements in a student’s dragon-riding ability. They conduct an experiment comparing the new and old coaching techniques.

To conduct their experiment they use a mix of students from the two different Quadrants: Riders (who have been riding dragons all their lives) and Scribes (who have never ridden a dragon before). Before any coaching is given all students are asked to complete a dragon assault course whilst atop a dragon, which is timed. Then, over a few months, each student randomly receives one of the coaching techniques (i.e., new or old). After this, the students are asked to complete another assault course of similar difficulty and their time is again recorded. Their improvement (if any) in dragon riding is measured by the difference between the two times.

However, the huge variation in baseline riding ability between these two groups (i.e., Riders are obviously going to be much better at riding dragons than Scribes) will obfuscate/obscure any improvement the new teaching technique may induce. Therefore, to find out if the new dragon-riding coaching technique works Professor Llewelyn blocks by Quadrant setting up the experiment as follows.

Title
Objective of experiment Examine the impact of a new dragon riding coaching technique
Primary Variable of Interest Improvement in dragon riding (difference between assult course completion times)
Nuisance Factors Introducing Variability Quadrant (i.e., Riders or Scribes)
Blocking We use blocking based on Quadrants to control for variability introduced by the differences among Riders and Scribes
Experimental Units Students are the experimental units
Blocking Procedure 1. Identify Quadrant of student (i.e., Rider or Scribe). 2. Create Blocks: Group students based on Quadrant. 3. Random Assignment within Blocks: Randomly assign students to different coaching techniques (i.e., old and new). This ensures each technique (e.g., old and new) is tested with different Quadrant.
Experimental Setup Each block (representing a Quadrant) contains a random assignment of students receiving different dragon riding coaching.
Benefits of Blocking By using blocking based on Quadrant, the experiment controls for potential variations dragon riding improvement caused by the baseline differences in a student’s dragon riding ability due to their Quadrant. This enhances the precision of the experiment and allows for a more accurate assessment of the impact of the coaching on dragon riding skill.
Example Outcome After the experiment, researchers analyze dragon riding improvement separately for each Quadrant, drawing conclusions about the technique’s effectiveness within specific Quadrants while minimizing the influence of Quadrant on the overall results.
TASK Identify the observational units, treatment levels, response variable, and the number of blocks in the experiment above.

A simple example: Charlotte’s coffee

Last Christmas I was given a gift set of three types of coffee beans. I want to know which makes the darkest coffee; to do this I measure the opacity after the coffee is made.

  • Three types of coffee beans are: Arabica , Liberica , and Robusta .
  • 12 identical cups are chosen and sets of four cups are randomly allocated one of three treatments.
  • Four sets of each type of coffee beans are ground and cups made (in the same way) resulting in a total of 12 cups of coffee, see below.
  • Samples are taken from each cup and the coffee colour is measured.

Experiment design:

Scientific question: Does coffee colour differ between bean types of coffee beans?

There are 3 treatments (types of coffee beans): Arabica, Liberica, and Robusta.

In this case the experimental unit would be the coffee cup as each one is allocated a different bean type (treatment).

The observational unit changes depending on the scenario (i.e., what and how samples are taken). For example,

  • if a single ml of liquid is taken from each cup and one measurement is taken per ml taken \(\rightarrow\) the observational unit would be the cup;
  • if single ml of liquid is taken from each cup and two subsamples are then taken from each ml, then if measurements are taken per subsample \(\rightarrow\) then the observational unit would be a subsample;
  • if four \(\times\) 1 ml of liquid were taken from each cup and from each ml a measurement is taken \(\rightarrow\) then the observational unity would be each 1ml sample.

Some basic experimental designs

Completely randomised design (CRD)

Let’s consider a completely randomized design with one treatment factor (e.g., coffee bean type). Here, \(n\) experimental units (e.g., cups) are divided randomly into \(t\) groups. Random allocation can be achieved by simply drawing lots from a hat! To be more rigorous, though, we could use R’s sample() function (have a go yourself and see if you can work out how to wield sample()). Each group is then given one treatment level (one of the treatment factors). As we have defined only one treatment factor all other known independent variables are kept constant so as to not bias any effects.

An illustration of a CRD with one tratment factor and three treatment levels (A, B, & C)

Designing a CRD using R

Here we’re going to use R to do the random allocation of treatments for us.

## create a character vector of bean types
beans <- rep(c("Arabica","Liberica", "Robusta"), each = 4)
beans
##  [1] "Arabica"  "Arabica"  "Arabica"  "Arabica"  "Liberica" "Liberica"
##  [7] "Liberica" "Liberica" "Robusta"  "Robusta"  "Robusta"  "Robusta"
## randomly sample the character vector to give the order of coffees
set.seed(1234) ## this is ONLY for consistency, remove if doing this yourself
allocation <- sample(beans, 12)
allocation
##  [1] "Robusta"  "Robusta"  "Liberica" "Liberica" "Arabica"  "Liberica"
##  [7] "Arabica"  "Robusta"  "Arabica"  "Liberica" "Robusta"  "Arabica"

Having run the code above your CRD plan is as follows

Cup Bean
1 Robusta
2 Robusta
3 Liberica
4 Liberica
5 Arabica
6 Liberica
7 Arabica
8 Robusta
9 Arabica
10 Liberica
11 Robusta
12 Arabica
TASK Run the R code above, but this time choose a different random seed by choosing a different number in this line of code set.seed(1234). What is your CRD plan? Why might keeping the random seed be important?

Randomised complete block design (RCBD)

Let’s consider a randomized complete block design with one treatment factor (e.g., coffee bean type). If the treatment factor has \(t\) levels there will be \(b\) blocks that each contain \(t\) experimental units resulting in a total of \(t\times b\) experimental units. For example, let’s imagine that for the coffee experiment we had two cup types: mugs and heatproof glasses. We might consider the type of receptacle to have an effect on the coffee strength measured, however, we are not interested in this. Therefore, to negate this we block by cup type. This means that any effect due to the blocking factor (cup type) is accounted for by the blocking.

For a blocked design we want the \(t\) experimental units within each block should be as homogeneous as possible (as similar as possible, so that there is unlikely to be unwanted variation coming into the experiment this way). The variation between blocks (the groups of experimental units) should be large enough (i.e., blocking factors different enough) so that conclusions can be drawn. Allocation of treatments to experimental units is done randomly (i.e., treatments are randomly assigned to units) within each block.

An illustration of a CRD with one tratment factor, three treatment levels (A, B, & C), and three blocks (rows)

Designing a RCBD using R

Let’s assume you want to set up an experiment similar to the CRD one above; however, now you are in the situation where you have two types of cups (mugs and heatproof glasses). Below we use R to do the random allocation of treatments within each block (cup type) for you.

Here we have \(t = 3\) treatments (bean types) and \(b = 2\) blocks (cup types) so we will have \(t \times b = 6\) experimental units in total.

set.seed(4321) ## this is ONLY for consistency, remove if doing this yourself
plan <- data.frame(Beans = rep(c("Arabica","Liberica", "Robusta"), times = 2), 
           Block = rep(c("Mug", "Glass"), each = 3)) %>% ## combine experiment variables
  group_by(Block) %>% ## group by blocking factor
  dplyr::sample_n(3)
plan
## # A tibble: 6 × 2
## # Groups:   Block [2]
##   Beans    Block
##   <chr>    <chr>
## 1 Arabica  Glass
## 2 Liberica Glass
## 3 Robusta  Glass
## 4 Liberica Mug  
## 5 Arabica  Mug  
## 6 Robusta  Mug

Having run the code above your RCBD plan is as follows

Receptacle 1 2 3 1 2 3
Beans Arabica Liberica Robusta Liberica Arabica Robusta
Block Glass Glass Glass Mug Mug Mug
TASK Run the R code above, but this time choose a different random seed by choosing a different number in this line of code set.seed(4321). What is your CRD plan? Why might keeping the random seed be important?

Factorial design

A factorial experiment is one where there are two or more sets of (factor) treatments7. Rather than studying each factor separately all combinations of the treatment factors are considered. Factorial designs enable us to infer any interaction effects, which may be present. An interaction effect is one where the effect of one variable depends on the value of another variable (i.e., the effect of one treatment factor on the response variable will change depending on the value of a second treatment factor.)

Let’s consider Dr Java’s experiment above. Here, Dr Java wants to study all combinations of the levels of each factor. Their objective is to determine what type of grind and brew ratio (i.e., amount of coffee in relation to water) leads to the strongest coffee.

Possible outcomes

Interactions

If an interaction effect exists the effect of one factor on the response will change depending on the level of the other factor.

The plot above is called an interaction plot. Creating such a plot is often very useful when drawing inference; in this instance we can see that the strength of the coffee changes depending on the type of coffee bean used, however, this relationship differs depending on the type of grind used. For example, Liberica beans produce stronger coffee than the other two beans when the fine grind is used, but weaker coffee when the coarse grind is used.

Designing factorial experiment using R

Here we use the R package edibble which is specifically designed to aid in the planning and designing of experiments.

## Factorial design (Brew and Grind)
design("Barista Brew: Factorial Design") %>%
  set_units(cup = 12) %>% # 2*2*3
  set_trts(BrewRatio =  c("2-1", "1-1", "1-2"),
           Grind = c("Fine", "Coarse")) %>%
  allot_trts(~cup) %>%
  assign_trts("random", seed = 836) %>% ## the random seed for random allocation
    serve_table()  
## # Barista Brew: Factorial Design 
## # An edibble: 12 x 3
##        cup BrewRatio  Grind
##    <U(12)>    <T(3)> <T(2)>
##      <chr>     <chr>  <chr>
##  1   cup01       1-2   Fine
##  2   cup02       2-1 Coarse
##  3   cup03       1-1   Fine
##  4   cup04       2-1 Coarse
##  5   cup05       1-2 Coarse
##  6   cup06       1-1 Coarse
##  7   cup07       1-2   Fine
##  8   cup08       1-1   Fine
##  9   cup09       1-2 Coarse
## 10   cup10       1-1 Coarse
## 11   cup11       2-1   Fine
## 12   cup12       2-1   Fine
TASK Run the R code above, but this time choose a different random seed (by changing seed = 836). What is your design? Now increase the number of replications, use your R skills to figure out what line of code needs changing here. What is your design now?

Modelling experimental data

A completely randomised design (CRD) as a linear model

As we’ve seen in the previous module that we can write a linear model with a single explanatory variable as

\[Y_i = \alpha + \beta_1x_i + \epsilon_i\]

When dealing with factor variables we use dummy variables and can write the above as

\[Y_{ik} = \alpha + \tau_k + \epsilon_{ik}\] where \(\tau_k\) is called an effect and represents the difference between the overall average, \(\alpha\), and the average at the \(k_{th}\) treatment level. The errors \(\epsilon_{ik}\) are again assumed to be normally distributed and independent due to the randomisation (i.e., \(\epsilon_{ik} \sim N(0, \sigma^2)\).

Or you might think of the model as

\[Y_{ik} = \mu_k + \epsilon_{ik}\]

where \(Y_{ik}\) is the response (i.e., observed coffee opacity) for the \(i^{th}\) experimental unit (i.e., coffee cup) subjected to the \(k^{th}\) level of the treatment factor (i.e., coffee type). Here \(\mu_k\) are the different (cell) means for each level of the treatment factor. See below for an illustration of this for three factor treatment levels (as in the coffee example above).

Analysis of a CRD in R

In this section we will again consider the rats data containing logAUC values for 12 rats subjected to three different treatments (Surgery), C, P, and S:

## Rows: 12
## Columns: 3
## $ Surgery <fct> C, C, C, C, P, P, P, P, S, S, S, S
## $ Rat     <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
## $ logAUC  <dbl> 8.49, 8.20, 9.08, 8.07, 10.24, 7.72, 9.34, 8.50, 11.31, 12.69,…

We could analyse these data using aov():

rats_aov <- aov(logAUC ~ Surgery, data = rats)
summary(rats_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Surgery      2 22.026  11.013   16.36 0.00101 **
## Residuals    9  6.059   0.673                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis: We test the Null hypothesis, \(H_0\), population (Surgery) means are the same on average verses the alternative hypothesis, \(H_1\), that at least one differs from the others!

Probability of getting an F-statistic at least as extreme as the one we observe (think of the area under the tails of the curve below) p-value Pr(>F)= 0.001 tells us we have sufficient evidence to reject \(H_0\) at the 1% level of significance

Alternatively, we could use lm():

rats_lm <- lm(logAUC ~ Surgery, data = rats)
summary(rats_lm)$coef
##             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)   8.4600  0.4102531 20.6214144 6.930903e-09
## SurgeryP      0.4900  0.5801856  0.8445574 4.202408e-01
## SurgeryS      3.0875  0.5801856  5.3215734 4.799872e-04

So, what does this tell us about which pairs of means are different?

To carry out a pair-wise comparisons of means we can use two-sample t-tests, calculating our observed t-value where \(\text{t-value} = \frac{\text{Sample Difference}_{ij} - \text{Difference assuming } H_0 \text{ is true}_{ij}}{\text{SE of } \text{Sample Difference}_{ij}}\). Here, \(\text{Sample Difference}_{ij}\) = Difference between pair of sample means. We can then compute the p-value for observed t-value.

The output above has already done this for us:

(Intercept) = \(\text{mean}_C\) = 8.46

SE of (Intercept) = SE of \(\text{mean}_C\) = SEM = 0.4102531

\(\text{Surgery}_P\) = \(\text{mean}_P\)\(\text{mean}_C\) = 0.49

SE of \(\text{Surgery}_P\) = SE of (\(\text{mean}_P\) - \(\text{mean}_C\) ) = SED = 0.5801856

Hypotheses being tested

The t value and Pr (>|t|) are the t-statistic and p-value for testing the null hypotheses listed below 1. Mean abundance is zero for C population 2. No difference between the population means of P and C 3. No difference between the population means of S and C

We’re interested in 2 and 3, but not necessarily 1!

F-test:

anova(rats_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Surgery    2 22.0263 11.0132  16.359 0.001006 **
## Residuals  9  6.0591  0.6732                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The same as aov() in fact aov() is calling lm() in the background.

Diagnostic plots

Carrying out any linear regression recall that we have some key assumptions

  • Independence
  • There is a linear relationship between the response and the explanatory variables
  • The residuals have constant variance
  • The residuals are normally distributed

TASK Check and comment on the assumptions of this model using the code below.

gglm(rats_lm) # Plot the four main diagnostic plots

A Factorial experiment (as a CRD)

Equal replications (balanced design)

The factorial data contain observations of global metabolic profiling and comparison of relative abundances of proteins (logAUC) in the inner and outer left ventricle (innerLV and outerLV) wall of diabetic and healthy male Wistar rats:

glimpse(factorial)
Organ Diabetic Healthy
innerLV n = 3 n = 3
outerLV n = 3 n = 3

Fitting models with interactions using lm()

## change to factors (saves errors later on)
factorial$Disease <- as.factor(factorial$Disease)
factorial$Organ <- as.factor(factorial$Organ)
fac_lm <- lm(logAUC ~ Disease*Organ, data = factorial)
summary(fac_lm)$coefficients
##                              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                  7.873333  0.5405835 14.564508 4.841215e-07
## DiseaseHealthy               1.950000  0.7645006  2.550685 3.413826e-02
## OrganouterLV                 2.116667  0.7645006  2.768692 2.434579e-02
## DiseaseHealthy:OrganouterLV -1.840000  1.0811671 -1.701865 1.271934e-01

So, the full model is

\[ \begin{aligned} \operatorname{\widehat{logAUC}} &= 7.87 + 1.95(\operatorname{Disease}_{\operatorname{Healthy}}) + 2.12(\operatorname{Organ}_{\operatorname{outerLV}}) - 1.84(\operatorname{Disease}_{\operatorname{Healthy}} \times \operatorname{Organ}_{\operatorname{outerLV}}) \end{aligned} \]

The three gobal null hypotheses being tested are

  1. \(H_0: \hat{\mu}_{\text{Diabetic}} = \hat{\mu}_{\text{Healthy}}\)
  2. \(H_0: \hat{\mu}_{\text{innerLV}} = \hat{\mu}_{\text{outerLV}}\)
  3. \(H_0: \hat{\mu}_{\text{Diabetic,innerLV}} = \hat{\mu}_{\text{Diabetic,outerLV}} = \hat{\mu}_{\text{Healthy,innerLV}} = \hat{\mu}_{\text{Healthy,outerLV}}\)
anova(fac_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 3.1827  3.1827  3.6304 0.09320 .
## Organ          1 4.2960  4.2960  4.9003 0.05775 .
## Disease:Organ  1 2.5392  2.5392  2.8963 0.12719  
## Residuals      8 7.0135  0.8767                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TASK Using the outputs above state your inference regarding the listed hypotheses.

Note with a balanced design ordering of term doesn’t matter. For example,

fac_lm <- lm(logAUC ~ Disease*Organ, data = factorial)
anova(fac_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 3.1827  3.1827  3.6304 0.09320 .
## Organ          1 4.2960  4.2960  4.9003 0.05775 .
## Disease:Organ  1 2.5392  2.5392  2.8963 0.12719  
## Residuals      8 7.0135  0.8767                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fac_lm_2 <- lm(logAUC ~ Organ*Disease, data = factorial)
anova(fac_lm_2)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Organ          1 4.2960  4.2960  4.9003 0.05775 .
## Disease        1 3.1827  3.1827  3.6304 0.09320 .
## Organ:Disease  1 2.5392  2.5392  2.8963 0.12719  
## Residuals      8 7.0135  0.8767                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unqual replications (unbalanced design)

Here, we will consider a subset of the data above (i.e., an artificially unbalanced study).

unbalanced <- factorial[- c(1:3,10), ]
glimpse(unbalanced)
## Rows: 8
## Columns: 5
## $ Disease <fct> Healthy, Healthy, Healthy, Diabetic, Diabetic, Diabetic, Diabe…
## $ Organ   <fct> outerLV, innerLV, outerLV, innerLV, outerLV, innerLV, innerLV,…
## $ Animal  <dbl> 4, 5, 6, 7, 8, 9, 11, 12
## $ Sample  <dbl> 2, 1, 2, 1, 2, 1, 1, 2
## $ logAUC  <dbl> 10.49, 9.74, 10.98, 7.92, 9.37, 8.69, 7.01, 9.29

Fitting models with interactions using lm()

Note with an unbalanced design ordering of the terms DOES matter. For example,

fac_lm <- lm(logAUC ~ Disease*Organ, data = unbalanced)
anova(fac_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 7.1102  7.1102 18.4955 0.01264 *
## Organ          1 3.1149  3.1149  8.1027 0.04656 *
## Disease:Organ  1 0.0913  0.0913  0.2376 0.65145  
## Residuals      4 1.5377  0.3844                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fac_lm_2 <- lm(logAUC ~ Organ*Disease, data = unbalanced)
anova(fac_lm_2)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Organ          1 5.7291  5.7291 14.9029 0.01814 *
## Disease        1 4.4960  4.4960 11.6953 0.02678 *
## Organ:Disease  1 0.0913  0.0913  0.2376 0.65145  
## Residuals      4 1.5377  0.3844                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The three global null hypotheses being tested are (essentially) the same

  1. \(H_0: \hat{\mu}_{\text{Diabetic}} = \hat{\mu}_{\text{Healthy}}\)
  2. \(H_0: \hat{\mu}_{\text{innerLV}} = \hat{\mu}_{\text{outerLV}}\)
  3. \(H_0: \hat{\mu}_{\text{Diabetic,innerLV}} = \hat{\mu}_{\text{Diabetic,outerLV}} = \hat{\mu}_{\text{Healthy,innerLV}} = \hat{\mu}_{\text{Healthy,outerLV}}\)

However, now the order the terms affects the estimation. Look carefully at the anova() outputs above; what to you notice about the sums of squares (Sum Sq) values?

For a balanced two-factor experiment (e.g., in the previous section) partitioning of the sums of squares (\(SS\)) is additive:

\[SS_\text{Treatment} = SS_\text{Disease} + SS_\text{Organ} + SS_\text{Disease:Organ}.\]

The order in which the main effects are added to the model does not matter. However, for an unbalanced design this is not the case as the sums of squares are calculated sequentially. Note that sequentially calculated \(SS\) are known as Type I \(SS\).

Sums of squares (\(SS\))

Sequential (Type I \(SS\))

As a term enters the model its \(SS\) is calculated, which is then subtracted from the total \(SS\). This then reduces the available \(SS\) for the next term entering the model. When treatment combinations in a factorial experiment are unequally replicated, their effects are not mutually independent, so that the order in which terms enter the model matters.

Considering the data above if we include Disease as a main effect first (i.e., Disease*Organ) then the \(SS_\text{Disease}\) is calculated first ignoring the Organ main effect. Here, some Organ information is confounded with Disease information (i.e., variation due to Organ confounded by the variation due to Disease). Then \(SS_\text{Organ}\) are calculated having been adjusted for the Diease main effect. This now only contains Organ information (i.e., variation due to Organ effect) since all the Disease information was eliminated in previous step. Finally, \(SS_\text{Disease:Organ}\) are calculated adjusted for both \(SS_\text{Disease}\) and \(SS_\text{Organ}\). Here, there is no information left relating to Disease or Organ main effects.

What does this look like?

For Disease*Organ we calculate \(SS_\text{Disease}\) ignoring any Organ effect (if you are unsure what each line of code is doing I suggest you run it line by line to see what’s being done at each step).

unbalanced %>% 
  mutate(grand_mean = mean(logAUC)) %>%
  group_by(Disease) %>%
  summarise(n = n(),
         treatment_mean = mean(logAUC),
         grand_mean = mean(grand_mean)) %>%
  mutate(ss_disease = n * (treatment_mean - grand_mean)^2) %>%
  pull(ss_disease) %>%
  sum()
## [1] 7.110201

This matches, as we’d expect, \(SS_\text{Disease}\) calculated from the Disease*Organ model above.

But, that about the sequential \(SS\), a simple way to think about this is in terms of sequential models:

## Null model.
fit.null <- lm(logAUC ~ 1,  data = unbalanced)
## Only Factor Disease.
fit.a <- lm(logAUC ~ Disease,  data = unbalanced)
## Factors Disease and Organ.
fit.ab <- lm(logAUC ~ Disease + Organ,  data = unbalanced)
## Factors Disease and Organ with interaction.
fit.abi <- lm(logAUC ~ Disease*Organ,  data = unbalanced)
## ANOVA table, as above
anova(fit.abi)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 7.1102  7.1102 18.4955 0.01264 *
## Organ          1 3.1149  3.1149  8.1027 0.04656 *
## Disease:Organ  1 0.0913  0.0913  0.2376 0.65145  
## Residuals      4 1.5377  0.3844                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## First line.
sum(residuals(fit.null)^2) - sum(residuals(fit.a)^2)
## [1] 7.110201
## Second line.
sum(residuals(fit.a)^2) - sum(residuals(fit.ab)^2)
## [1] 3.114926
## Third line.
sum(residuals(fit.ab)^2) - sum(residuals(fit.abi)^2)
## [1] 0.09134405
## Final line.
sum(residuals(fit.abi)^2)
## [1] 1.537717

Type II \(SS\)

Rather than calculating \(SS\) sequentially we can calculate the \(SS\) for a given effect adjusting for all other effects listed in the model. This means that the \(SS_\text{A}\) and \(SS_\text{B}\) main effects will both be adjusted for each other (since neither contains the other), but will not be adjusted for \(SS_\text{A:B}\) (since it contains both A and B). \(SS_\text{A:B}\) will be adjusted for both main effects.

In our example above \(SS_\text{Disease}\) and \(SS_\text{Organ}\) main effects will both be adjusted for each other, but will not be adjusted for \(SS_\text{A:B}\) and \(SS_\text{Disease:Organ}\) will be adjusted for both main effects.

We can calculate the Type II \(SS\) table in R buy either

  1. Extracting the main effect rows that have been adjusted for the other model terms from two Type I \(SS\) tables (each one having the terms specified in a different order):
## Type II Organ SS
anova(fac_lm)[2, ]
## Analysis of Variance Table
## 
## Response: logAUC
##       Df Sum Sq Mean Sq F value  Pr(>F)  
## Organ  1 3.1149  3.1149  8.1027 0.04656 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Type II Disease SS
anova(fac_lm_2)[2, ]
## Analysis of Variance Table
## 
## Response: logAUC
##         Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease  1  4.496   4.496  11.695 0.02678 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Type II Organ:Disease/Disease:Organ
anova(fac_lm_2)[3, ]
## Analysis of Variance Table
## 
## Response: logAUC
##               Df   Sum Sq  Mean Sq F value Pr(>F)
## Organ:Disease  1 0.091344 0.091344  0.2376 0.6514

or,

  1. Using an inbuilt R function (does not matter which order the model has the terms specified):
car::Anova(fac_lm, type = 2)
## Anova Table (Type II tests)
## 
## Response: logAUC
##               Sum Sq Df F value  Pr(>F)  
## Disease       4.4960  1 11.6953 0.02678 *
## Organ         3.1149  1  8.1027 0.04656 *
## Disease:Organ 0.0913  1  0.2376 0.65145  
## Residuals     1.5377  4                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: Type III \(SS\) and beyond are not covered in this course but I would recommend reading this recent paper for an intuitive overview.

Marginal means

Balanced design

Table 1: Table 2: logAUC data from a balanced design
Disease Organ logAUC
Healthy innerLV 9.40
Healthy outerLV 8.83
Healthy innerLV 10.33
Healthy outerLV 10.49
Healthy innerLV 9.74
Healthy outerLV 10.98
Diabetic innerLV 7.92
Diabetic outerLV 9.37
Diabetic innerLV 8.69
Diabetic outerLV 11.31
Diabetic innerLV 7.01
Diabetic outerLV 9.29
Grand mean
9.447
Table 1: Table 1: Cell means
Disease Organ log AUC mean
Diabetic innerLV 7.873
Diabetic outerLV 9.990
Healthy innerLV 9.823
Healthy outerLV 10.100
Table 1: Table 1: Marginal (organ) mean
Organ log AUC mean
innerLV 8.848
outerLV 10.045
Table 1: Table 1: Marginal (disease) mean
Disease log AUC mean
Diabetic 8.932
Healthy 9.962

Unbalanced design (Unequal replication due to missing data)

Table 3: Table 4: logAUC data from an unbalanced design
Disease Organ logAUC
Healthy outerLV 10.49
Healthy innerLV 9.74
Healthy outerLV 10.98
Diabetic innerLV 7.92
Diabetic outerLV 9.37
Diabetic innerLV 8.69
Diabetic innerLV 7.01
Diabetic outerLV 9.29
Grand mean
9.186
Table 5: Table 6: Cell means
Disease Organ log AUC mean
Diabetic innerLV 7.873
Diabetic outerLV 9.330
Healthy innerLV 9.740
Healthy outerLV 10.735

Everything is as we’d expect up until now, but what about the marginal means? The, perhaps, most obvious way would be to do the following (i.e., ignore subgroups, hence give all observations equal weight)?

unbalanced %>%
   dplyr::select(c(Disease, logAUC)) %>%
   group_by(Disease) %>%
   summarise(Mean = mean(logAUC))
## # A tibble: 2 × 2
##   Disease   Mean
##   <fct>    <dbl>
## 1 Diabetic  8.46
## 2 Healthy  10.4

However, as a result of this the means are biased towards groups with greater replication! To avoid this we give the subgroups (Organ) cell means equal weight (this is sometimes called the least squares mean):

unbalanced %>%
   dplyr::select(c(Disease, Organ, logAUC)) %>%
   group_by(Organ, Disease) %>%
     mutate(n = n()) %>% ## count observations in each group
    ## then calculate weighted mean based on the no. observations
   summarise(weighted_mean = weighted.mean(logAUC, w = 1/n)) %>%
     group_by(Disease) %>% ## calculate mean of weighted means
     summarise(mean = mean(weighted_mean))
## # A tibble: 2 × 2
##   Disease   mean
##   <fct>    <dbl>
## 1 Diabetic  8.60
## 2 Healthy  10.2

Now see if you can do the same across Organ to get

Table 7: Table 8: Marginal (organ) means
Organ mean
innerLV 8.807
outerLV 10.032

What does all of this tell us? When calculating marginal means for unbalanced designs we need to be careful! Luckily there are inbuilt functions in R that do this correctly for us! See the next section for an example using the predictmeans function from the predictmeans package.

Multiple comparisons

Recall that each time we carry out a hypothesis test the probability we get a false positive result (type I error) is given by \(\alpha\) (the level of significance we choose).

When we have multiple comparisons to make we should then control the Type I error rate across the entire family of tests under consideration, i.e., control the Family-Wise Error Rate (FWER); this ensures that the risk of making at least one Type I error among the family of comparisons in the experiment is \(\alpha\).

State of Nature Don’t reject \(H_0\) reject \(H_0\)
\(H_0\) is true Type I error
\(H_0\) is false Type II error

The familywise error rate (FWER) is the risk of making at least one Type I error among the family of comparisons in the experiment. Now let’s consider carrying out \(m\) independent t-tests and let for any single test, let Pr(commit a Type 1 error) \(= \alpha_c\) be the per comparison error rate (PCER). So for a single test the probability a correct decision is made is \(1 - \alpha_c\). Therefore for \(m\) independent t-tests the probability of committing no Type I errors is \((1 - \alpha_c)^m\) and the probability of committing at least one Type I error is \(1 -(1 - \alpha_c)^m = \alpha_F\) which is the upper limit of the FWER.

The False Discovery Rate (FDR) controls the expected (mean) proportion of false discoveries among the \(R\) (out of \(m\)) hypotheses declared significant.

Consider testing \(m\) null hypotheses with corresponding p-values \(P_1, P_2,...,P_m\); we then order then so that \(P_{(1)} < P_{(2)} <...<P_{(m)}\) (where \(P_{(i)}\) is the \(i^{th}\) largest \(i=1,...,m\)). The \(i^{th}\) ordered p-value is calculated as \(\frac{i}{m}q^*\) and the \(i^{th}\) null hypothesis is rejected if \(P_i \leq \frac{i}{m}q^*\)

Adjustments for multiple testing

Calculating 95% confidence intervals for pairwise comparisons is similar to the procedure discussed in module 2:

\[\text{95% CI} = \text{estimate} \pm (\text{scale factor} \times \text{standard error}_\text{of estimate}).\] For the 95% CI for pairwise comparisons of means this becomes

\[\text{95% CI} = \text{difference} \pm (\text{scale factor} \times \text{SED}),\] where SED is the standard error of the difference (which depends on group replication). The choice of scale factor depends on what adjustments we might want to make. Below we cover three common adjustments. Specifically where (scale factor × SED) is the 1) Fisher correction LSD, 2) Bonferroni correction LSD, and 3) Tukey’s HSD.

Fisher’s Least Significant Difference (LSD)

To calculate the LSD the \(t\)-distribution is used, specifically the \[t_{\alpha = \frac{\alpha_c}{2}, \text{df} = N - m}\] distribution, where \(N\) is the number of observations and \(m\) is the number of treatment groups. To obtain the critical value for \(\alpha_c = 0.05\), \(N = 12\) and \(m = 3\) the following code can be used in R.

qt(p = 0.05/2,df = 12 - 3, lower.tail = FALSE)
## [1] 2.262157

Fisher’s LSD is then \[t_{\alpha = \frac{\alpha_c}{2}, \text{df} = N - m} \times \text{SED}.\]

Here we carry out post-hoc tests only if the ANOVA F-test is significant. If so declare significant \(100\alpha\%\) any pairwise difference > LSD. This does not control the FWER.

Bonferroni correction

To calculate Bonferroni correction the \(t\)-distribution is used, specifically the \[t_{\alpha = \frac{\alpha_c}{2 \times k}, \text{df} = N - m}\] distribution, where \(N\) is the number of observations, \(m\) is the number of treatment groups, and \(k = {m \choose 2}\) is the number of pairwise comparisons being made.

To obtain the critical value for \(\alpha_c = 0.05\), \(N = 12\) and \(m = 3\) the following code can be used in R.

qt(p = 0.05/(2 * choose(3,2)),df = 12 - 3, lower.tail = FALSE)
## [1] 2.933324

Bonferroni’s LSD is then \[t_{\alpha = \frac{\alpha_c}{2 \times k}, \text{df} = N - m} \times \text{SED}.\]

Here we reject the \(H_0\) for which the p-value, p-val, is p-val \(< \alpha_c = \frac{\alpha_f}{n_c}\) where \(\alpha_f\) is the FWER and \(n_c\) is the number of pairwise comparisons. However, this makes no assumptions about independence between tests.

Tukey’s Honest Significant Difference (HSD)

This compares the mean of every treatment with the mean of every other treatment and uses a studentised range distribution compared with a \(t\)-distribution for Fisher’s LSD and the Bonferroni correction. Therefore to calculate the HSD the studentised range distribution is used, specifically the \[q_{ 1- \alpha_c, m, \text{df} = N - m}\] distribution, where \(N\) is the number of observations, \(m\) is the number of treatment groups, and \(q_{\alpha,m,\text{df}}\) quantile of the studendised distribution.

To obtain the critical value for \(\alpha_c = 0.05\), \(N = 12\) and \(m = 3\) the following code can be used in R.

1 - qtukey(p = 1 - 0.05, nmeans = 3, df = 12 - 3)
## [1] -2.948492

Tukey’s HSD is given by \[\frac{q_{ 1 - \alpha_c, m, \text{df} = N - m}}{\sqrt{2}} \times \sqrt{\frac{2\hat{\sigma}^2}{n}}.\] Here \(\hat{\sigma}^2\) is the residual mean square error and \(n\) is the assumed equal number of replicates in each group.

Classification of multiple hypothesis tests

Suppose we have a number \(m\) of null hypotheses, \(H_1, H_2, ..., H_m\). Using the traditional parlance we reject the null hypothesis if the test is declared significant and have no evidence to reject the null hypothesis if the test is “not significant”. Now, summing each type of outcome over all \(H_i (i = 1.,..,m)\) yields the following random variables:

Null hypothesis is true (H0) Alternative hypothesis is true (HA) Total
Test is declared significant V S R
Test is declared non-significant U T m - R
Total \(m_{0}\) \(m - m_0\) m
  • \(m\) is the total number hypotheses tested
  • \(m_{0}\) is the number of true null hypotheses, an unknown parameter
  • \(m - m_0\) is the number of true alternative hypotheses
  • \(V\) is the number of false positives (Type I error) (also called false discoveries)
  • \(S\) is the number of true positives (also called true discoveries)
  • \(T\) is the number of false negatives (Type II error)
  • \(U\) is the number of true negatives
  • \(R=V+S\) is the number of rejected null hypotheses (also called discoveries, either true or false)

Multiple comparison procedures

Recall the CRD rats data:

glimpse(rats)
## Rows: 12
## Columns: 3
## $ Surgery <fct> C, C, C, C, P, P, P, P, S, S, S, S
## $ Rat     <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
## $ logAUC  <dbl> 8.49, 8.20, 9.08, 8.07, 10.24, 7.72, 9.34, 8.50, 11.31, 12.69,…

To use predictmeans later on we have to ensure that the relevant variables are coded as factors:

rats <- rats %>%
  mutate(Surgery = as.factor(Surgery))

We can fit a linear model using lm():

rats_lm <- lm(logAUC ~ Surgery, data = rats)
coef(rats_lm)
## (Intercept)    SurgeryP    SurgeryS 
##      8.4600      0.4900      3.0875

Our fitted model is therefore

\[ \begin{aligned} \operatorname{\widehat{logAUC}} &= 8.46 + 0.49(\operatorname{Surgery}_{\operatorname{P}}) + 3.09(\operatorname{Surgery}_{\operatorname{S}}) \end{aligned} \]

Using predictmeans from the predictmeans package

By default Fisher’s comparisons are made.

pm <- predictmeans(rats_lm , modelterm = "Surgery", 
                                     pairwise = TRUE, plot = FALSE)
str(pm)
## List of 10
##  $ Predicted Means              : 'xtabs' num [1:3(1d)] 8.46 8.95 11.55
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ Surgery: chr [1:3] "C" "P" "S"
##   ..- attr(*, "call")= language xtabs(formula = pm ~ ., data = mt[, c("pm", vars)], drop.unused.levels = TRUE)
##  $ Standard Error of Means      : Named num 0.41
##   ..- attr(*, "names")= chr "All means have the same SE"
##  $ Standard Error of Differences: Named num [1:3] 0.58 0.58 0.58
##   ..- attr(*, "names")= chr [1:3] "Max.SED" "Min.SED" "Aveg.SED"
##  $ LSD                          : Named num [1:3] 1.31 1.31 1.31
##   ..- attr(*, "names")= chr [1:3] "Max.LSD" "Min.LSD" "Aveg.LSD"
##   ..- attr(*, "Significant level")= num 0.05
##   ..- attr(*, "Degree of freedom")= num 9
##  $ Pairwise LSDs                : num [1:3, 1:3] 0 1.31 1.31 -0.49 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "C" "P" "S"
##   .. ..$ : chr [1:3] "C" "P" "S"
##   ..- attr(*, "Significant level")= num 0.05
##   ..- attr(*, "Degree of freedom")= int 9
##   ..- attr(*, "Note")= chr "LSDs matrix has mean differences (row-col) above the diagonal, LSDs (adjusted by 'none' method) below the diagonal"
##  $ Pairwise p-value             : num [1:3, 1:3] 1 0.4202 0.0005 -0.8446 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "C" "P" "S"
##   .. ..$ : chr [1:3] "C" "P" "S"
##   ..- attr(*, "Degree of freedom")= int 9
##   ..- attr(*, "Note")= chr "The matrix has t-value above the diagonal, p-value (adjusted by 'none' method) below the diagonal"
##  $ predictmeansPlot             :List of 2
##   ..$ meanPlot: NULL
##   ..$ ciPlot  : NULL
##  $ predictmeansBarPlot          : NULL
##  $ mean_table                   :List of 2
##   ..$ Table:'data.frame':    3 obs. of  7 variables:
##   .. ..$ Surgery  : Factor w/ 3 levels "C","P","S": 1 2 3
##   .. ..$ Mean     : num [1:3] 8.46 8.95 11.55
##   .. ..$ SE       : num [1:3] 0.41 0.41 0.41
##   .. ..$ Df       : num [1:3] 9 9 9
##   .. ..$ LL(95%)  : num [1:3] 7.53 8.02 10.62
##   .. ..$ UL(95%)  : num [1:3] 9.39 9.88 12.48
##   .. ..$ LetterGrp: chr [1:3] "A " "A " " B"
##   ..$ Note : chr "Letter-based representation of pairwise comparisons at significant level '0.05'"
##  $ p_valueMatrix                : num [1:3, 1:3] 0 0.4202 0.0005 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "C" "P" "S"
##   .. ..$ : chr [1:3] "C" "P" "S"
##  - attr(*, "class")= chr "pdmlist"

Let’s look at some of the pm elements more closely:

pm$mean_table
## $Table
##   Surgery    Mean      SE Df LL(95%) UL(95%) LetterGrp
## 1       C  8.4600 0.41025  9  7.5319  9.3881        A 
## 2       P  8.9500 0.41025  9  8.0219  9.8781        A 
## 3       S 11.5475 0.41025  9 10.6194 12.4756         B
## 
## $Note
## [1] "Letter-based representation of pairwise comparisons at significant level '0.05'"

This gives us the estimated group means and associated 95% CIs. Why are the standard errors all equal?

But, what we’d like is the pairwise comparisons between groups. Information pertaining to this is also returned:

print(pm$`Pairwise p-value`)
##        C       P       S
## C 1.0000 -0.8446 -5.3216
## P 0.4202  1.0000 -4.4770
## S 0.0005  0.0015  1.0000
## attr(,"Degree of freedom")
## [1] 9
## attr(,"Note")
## [1] "The matrix has t-value above the diagonal, p-value (adjusted by 'none' method) below the diagonal"

This gives us the pairwise comparison statistic (the \(t\)-statistic in this case) on the upper diagonal and the associated p-value’s on the lower diagonal.

What about the 95%CI for the comparisons? Luckily predictmeans also returns the pairwise LSD values (Fisher’s by default with \(\alpha_c = 0.05\)):

pm$`Pairwise LSDs`
##         C        P       S
## C 0.00000 -0.49000 -3.0875
## P 1.31247  0.00000 -2.5975
## S 1.31247  1.31247  0.0000
## attr(,"Significant level")
## [1] 0.05
## attr(,"Degree of freedom")
## [1] 9
## attr(,"Note")
## [1] "LSDs matrix has mean differences (row-col) above the diagonal, LSDs (adjusted by 'none' method) below the diagonal"

Here, the upper diagonal matrix has the pairwise differences and the lower has the LSD values.

Fisher’s LSD

So, we have all the information to construct the Fisher LSD 95% CIs! You could extract all the information and construct a pairwise comparison table manually, or use a pre-written function (aren’t I nice to provide one!):

url <- "https://gist.github.com/cmjt/72f3941533a6bdad0701928cc2924b90"
devtools::source_gist(url, quiet = TRUE)
comparisons(pm)
##   Comparison Difference  SED   LSD    lwr    upr      t      p
## 1        C-P     -0.490 0.58 1.312 -1.802  0.822 -0.845 0.4202
## 2        C-S     -3.087 0.58 1.312 -4.400 -1.775 -5.322 0.0005
## 3        P-S     -2.598 0.58 1.312 -3.910 -1.285 -4.477 0.0015

Bonferroni correction

To use the Bonferroni correction we must now calculate and specify the adjusted \(\alpha_b = \frac{\alpha_b}{n_c}\), in our case this is \(\frac{0.05}{{3 \choose 2}}\). We also specify adj = "bonferroni":

alpha.adj <- 0.05/choose(3,2)
bonferroni <-  predictmeans::predictmeans(rats_lm , 
                                          modelterm = "Surgery", adj = "bonferroni",
                                          level = alpha.adj,
                                          pairwise = TRUE, plot = FALSE)
comparisons(bonferroni)

Tukey’s Honest Significant Difference (HSD)

Things are a bit more cumbersome when it comes to Tukey’s HSD. We can specify adj = "tukey" in predictmeans:

tukey <-  predictmeans::predictmeans(rats_lm , 
                                          modelterm = "Surgery", adj = "tukey",
                                          level = alpha.adj,
                                          pairwise = TRUE, plot = FALSE)
names(tukey)

However, the HSD values are not returned, so we cannot calculate out pairwise CIs. R has an inbuilt function TukeyHSD that does this for us, but as we can see below the HSD values are not returned.

TukeyHSD(aov(logAUC~Surgery, data = rats))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = logAUC ~ Surgery, data = rats)
## 
## $Surgery
##       diff        lwr      upr     p adj
## P-C 0.4900 -1.1298813 2.109881 0.6863267
## S-C 3.0875  1.4676187 4.707381 0.0012479
## S-P 2.5975  0.9776187 4.217381 0.0039400

From above we saw that Tukey’s HSD was given by

\[\frac{q_{1-\alpha_c, m, \text{df} = N - m}}{\sqrt{2}} \times \sqrt{\frac{2\hat{\sigma}^2}{n}}.\]

We can calculate \(q_{1 - \alpha_c, m, \text{df} = N - m}\) for our data using

qtukey(p = 1- 0.05, nmeans = 3, df = 12 - 3)
## [1] 3.948492

Assuming equal replicates (\(n = 4\), with \(\hat{\sigma}^2\) as above) we calculate the SED as

sqrt(2 * anova(rats_lm)[2,3] / 4)
## [1] 0.5801856

Therefore, Tukey’s HSD is \(\frac{3.9484922}{\sqrt{2}}\times 0.5801856 = 1.6198813.\)

To calculate the CIs we use \(\text{difference} \pm \text{HSD}\):

HSD <- (qtukey(p = 1 - 0.05, nmeans = 3, df = 12 - 3)/sqrt(2))*sqrt(2 * anova(rats_lm)[2,3] / 4)
all_diffs <- outer(tukey$`Predicted Means`, tukey$`Predicted Means`, "-")
comparison_table <- data.frame(differences = all_diffs[lower.tri(all_diffs)]) %>%
  mutate(upper = differences + HSD, lower = differences - HSD, HSD = HSD)
         
comparison_table

This matches the output from TukeyHSD() above!

What about the values returned by predictmeans()?

print(tukey$`Pairwise p-value`)

There is an easier way to see the same pairwise contrasts using the emmeans function from the package of the same name and the pairs function:

em <- emmeans::emmeans(rats_lm, specs =  "Surgery")
pairs(em, adjust = "tukey")
##  contrast estimate   SE df t.ratio p.value
##  C - P       -0.49 0.58  9  -0.845  0.6863
##  C - S       -3.09 0.58  9  -5.322  0.0012
##  P - S       -2.60 0.58  9  -4.477  0.0039
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

This, gives us the same information as predictmeans but in an easier to read format (still no CIs though, we still have to calculate those ourselves!)

However, the emmeanspackage facilitates some nice plotting of the pairwise comparisons:

plot(em, comparisons = TRUE) + theme_bw()

Linear mixed-effect models (LMMs)

Recall, blocking helps control variability by making treatment groups more alike. Experimental units are divided into subsets (called blocks) so that units within the same block are more similar than units from different subsets or blocks. Blocking is a technique for dealing with nuisance factors. A nuisance factor is a factor that has some effect on the response, but is of no interest (e.g., age class).

Fixed effects are terms (parameters) in a statistical model which are fixed, or non-random, quantities (e.g., treatment group’s mean response). For the same treatment, we expect this quantity to be the same from experiment to experiment.

Random effects are terms (parameters) in a statistical model which are considered as random quantities or variables (e.g., block id). Specifically, terms whose levels are a representative sample from a population, and where the variance of the population is of interest should be allocated as random. Setting a block as a random effect allows us to infer variation between blocks as well as the variation between experimental units within blocks.

Key idea: Partition known sources of variation which are unimportant to key scientific question(s) to improve precision of comparisons between treatment means.

A Randomised Controlled Block Design (RCBD)

The rcbd data:

glimpse(rcbd)

To use predictmeans later on we have to ensure that the relevant variables are coded as factors:

rcbd <- rcbd %>%
  mutate(Run = as.factor(Run)) %>%
  mutate(Surgery = as.factor(Surgery))

Run as a fixed effect

lm <- lm(logAUC8 ~ Run + Surgery, data = rcbd)
summary(lm)
## 
## Call:
## lm(formula = logAUC8 ~ Run + Surgery, data = rcbd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7517 -0.3683 -0.0900  0.4508  1.5817 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.6583     0.8506   9.004 0.000105 ***
## Run2         -2.0167     0.9822  -2.053 0.085854 .  
## Run3          0.4100     0.9822   0.417 0.690882    
## Run4          1.2933     0.9822   1.317 0.235963    
## SurgeryP      1.9750     0.8506   2.322 0.059293 .  
## SurgeryS      3.8500     0.8506   4.526 0.003991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.203 on 6 degrees of freedom
## Multiple R-squared:  0.8449, Adjusted R-squared:  0.7157 
## F-statistic: 6.538 on 5 and 6 DF,  p-value: 0.02034

Run as a random effect

Note Both the lmerTest and lme4 packages export a function called lmer, which fits linear mixed effect models. For all estimation purposes etc. they essentially do the same think. However, the structure of the output differs and this MATTERS when we later come to use other funtions on the outputs. Therefore, I take care below to specify which version of the lmer function I use by explicitly calling it from either lmerTest or lme4 using the :: syntax.

There are, confusingly, two ways of fitting the same model. For inference we require both!

Option 1 uses the lmer function from the lme4 package:

lmer4_mod <- lme4::lmer(logAUC8 ~ Surgery + (1|Run), data = rcbd)
summary(lmer4_mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: logAUC8 ~ Surgery + (1 | Run)
##    Data: rcbd
## 
## REML criterion at convergence: 37.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8525 -0.2273  0.1772  0.4036  1.3309 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Run      (Intercept) 1.479    1.216   
##  Residual             1.447    1.203   
## Number of obs: 12, groups:  Run, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   7.5800     0.8552   8.863
## SurgeryP      1.9750     0.8506   2.322
## SurgeryS      3.8500     0.8506   4.526
## 
## Correlation of Fixed Effects:
##          (Intr) SrgryP
## SurgeryP -0.497       
## SurgeryS -0.497  0.500

Option 2 uses the lmer function from the lmerTest package:

lmerTest_mod <- lmerTest::lmer(logAUC8 ~ Surgery + (1|Run), data = rcbd)
summary(lmerTest_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logAUC8 ~ Surgery + (1 | Run)
##    Data: rcbd
## 
## REML criterion at convergence: 37.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8525 -0.2273  0.1772  0.4036  1.3309 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Run      (Intercept) 1.479    1.216   
##  Residual             1.447    1.203   
## Number of obs: 12, groups:  Run, 4
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)   7.5800     0.8552 5.9567   8.863 0.000119 ***
## SurgeryP      1.9750     0.8506 6.0000   2.322 0.059293 .  
## SurgeryS      3.8500     0.8506 6.0000   4.526 0.003991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) SrgryP
## SurgeryP -0.497       
## SurgeryS -0.497  0.500

As you can see they give the same output! Why bother, you might ask?! This will become apparent later on.

Inference about the random effects

We have two variance components

  • Between Groups (Runs) \(\hat{\sigma^2}_{\text{Run}}\) = 1.479
  • Within Runs (between observations) \(\hat{\sigma_2}\) = 1.447

Note that aov() presents the same information, but in a different way:

summary(aov(logAUC8 ~ Surgery + Error(Run), data = rcbd))
## 
## Error: Run
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  3  17.65   5.883               
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Surgery    2 29.652  14.826   10.25 0.0116 *
## Residuals  6  8.682   1.447                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Within Runs (Residuals) \(\hat{\sigma}_2\) = 1.447 (same as lmer)
  • Between Run variance = \(\hat{\sigma}^2\) + \(3\:\hat{\sigma}^2_{\text{Run}}\) so \(\hat{\sigma}^2_{\text{Run}} = \frac{5.883 - \hat{\sigma}^2 }{3} = \frac{5.883 - 1.447}{3} = 1.479\)

Inference about the fixed effects

Specifying Run as random effect changes our estimated baseline (i.e., Intercept coefficient) as now and effect due to Run is attributed to the structural component of the model.

We can interpret the fixed effects of a LMM as we might for a linear model (now the Intercept estimate changes depending on Run).

coefficients(lmer4_mod)
## $Run
##   (Intercept) SurgeryP SurgeryS
## 1    7.639067    1.975     3.85
## 2    6.118411    1.975     3.85
## 3    7.948225    1.975     3.85
## 4    8.614297    1.975     3.85
## 
## attr(,"class")
## [1] "coef.mer"
coefficients(lmerTest_mod)
## $Run
##   (Intercept) SurgeryP SurgeryS
## 1    7.639067    1.975     3.85
## 2    6.118411    1.975     3.85
## 3    7.948225    1.975     3.85
## 4    8.614297    1.975     3.85
## 
## attr(,"class")
## [1] "coef.mer"

What about an ANOVA table? NOTE that specifying the type of \(SS\) (e.g., Type I, II, or III) in an anova call only works with a model fitted using lmerTest::lmer:

anova(lmerTest_mod, type = 1)
## Type I Analysis of Variance Table with Satterthwaite's method
##         Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
## Surgery 29.652  14.826     2     6  10.246 0.01162 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lmerTest_mod, type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
##         Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
## Surgery 29.652  14.826     2     6  10.246 0.01162 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, as we only have a single fixed effect in our model (Surgery) the ANOVA Type I and Type II tables above are equivalent!

Inference using predictmeans() (Note: output will be the same for the lmerTest model.)

pm <- predictmeans(lmer4_mod, modelterm = "Surgery", pairwise = TRUE, plot = FALSE)

Using the elements of the predictmeans object (as in the last section) we can extract the pairwise comparison information:

pm$`Pairwise LSDs`
##         C        P      S
## C 0.00000 -1.97500 -3.850
## P 2.08132  0.00000 -1.875
## S 2.08132  2.08132  0.000
## attr(,"Significant level")
## [1] 0.05
## attr(,"Degree of freedom")
## [1] 6
## attr(,"Note")
## [1] "LSDs matrix has mean differences (row-col) above the diagonal, LSDs (adjusted by 'none' method) below the diagonal"
print(pm$`Pairwise p-value`)
##        C       P       S
## C 1.0000 -2.3219 -4.5263
## P 0.0593  1.0000 -2.2043
## S 0.0040  0.0697  1.0000
## attr(,"Note")
## [1] "The matrix has t-value above the diagonal, p-value (adjusted by 'none' method) below the diagonal"

Gives us 1) pairwise differences (upper diagonal) and the LSD values (lower diagonal), and 2) the pairwise comparison statistic (the \(t\)-statistic in this case) on the upper diagonal and the associated p-value’s on the lower diagonal.

So is there a pairwise difference in means? Let’s organise the information in a table. Below we read a specifically designed function to do this from the given URL.

comparisons(pm)
##   Comparison Difference   SED   LSD    lwr    upr      t      p
## 1        C-P     -1.975 0.851 2.081 -4.056  0.106 -2.322 0.0593
## 2        C-S     -3.850 0.851 2.081 -5.931 -1.769 -4.526 0.0040
## 3        P-S     -1.875 0.851 2.081 -3.956  0.206 -2.204 0.0697

Have a look at the CIs and p-values. What do you conclude?

We could also plot the pairwise comparisons using emmeans

em <- emmeans(lmer4_mod, specs =  "Surgery")
plot(em, pairwise = TRUE) + theme_bw()

A Split-plot design

glimpse(split_plot)
## Rows: 12
## Columns: 5
## $ Disease <chr> "Healthy", "Healthy", "Healthy", "Healthy", "Healthy", "Health…
## $ Organ   <chr> "innerLV", "outerLV", "innerLV", "outerLV", "innerLV", "outerL…
## $ Animal  <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6
## $ Sample  <dbl> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2
## $ logAUC  <dbl> 9.40, 8.83, 10.33, 10.49, 9.74, 10.98, 7.92, 9.37, 8.69, 11.31…

To use predictmeans later on we have to ensure that the relevant variables are coded as factors:

split_plot <- split_plot %>%
  mutate(Animal = as.factor(Animal)) %>%
  mutate(Disease = as.factor(Disease))%>%
  mutate(Organ = as.factor(Organ))

Animal as a random effect

Using aov()

sp_aov <- aov(logAUC ~ Disease*Organ + Error(Animal), data = split_plot)
summary(sp_aov)
## 
## Error: Animal
##           Df Sum Sq Mean Sq F value Pr(>F)
## Disease    1  3.183   3.183   2.187  0.213
## Residuals  4  5.822   1.456               
## 
## Error: Within
##               Df Sum Sq Mean Sq F value Pr(>F)  
## Organ          1  4.296   4.296  14.423 0.0191 *
## Disease:Organ  1  2.539   2.539   8.525 0.0433 *
## Residuals      4  1.191   0.298                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using lmer() (from lmeTest and lmer4)

Recall that specifying the type of \(SS\) (e.g., Type I, II, or III) in an anova call only works with a model fitted using lmerTest::lmer. As we now have specified an interaction model then the type of \(SS\) calculated will have an effect on inference for unbalanced designs.

sp_lmer <- lmerTest::lmer(logAUC ~ Disease*Organ + (1|Animal), 
                          data = split_plot) 
anova(sp_lmer,type = 1)
## Type I Analysis of Variance Table with Satterthwaite's method
##               Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
## Disease       0.6513  0.6513     1     4  2.1866 0.21329  
## Organ         4.2960  4.2960     1     4 14.4227 0.01914 *
## Disease:Organ 2.5392  2.5392     1     4  8.5246 0.04326 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(sp_lmer,type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
##               Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
## Disease       0.6513  0.6513     1     4  2.1866 0.21329  
## Organ         4.2960  4.2960     1     4 14.4227 0.01914 *
## Disease:Organ 2.5392  2.5392     1     4  8.5246 0.04326 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note here, though, our design is balanced hence the ordering of terms in our model does not make a difference (no need to specify Type II \(SS\))

Pairwise comparison of time means

When a design has blocking, to get summary stats using predictmeans you should fit the model using lme4::lmer():

Calling predictmeans alone produces plots:

lmer <- lme4::lmer(logAUC ~ Disease*Organ + (1|Animal), 
                          data = split_plot)
predmeans <- predictmeans::predictmeans(model = lmer ,modelterm = "Disease:Organ", 
                                        pairwise = TRUE)

predmeans$predictmeansPlot
## $meanPlot

## 
## $ciPlot

Recall from previous sections that the LSD vale is essentially the buffer around the point estimate (the radius of the CI if you like), beyond the limit of which we might believe there to be a significant difference (a bit lax with terminology there!).

How would be interpret the above plot? I would conclude that there is a big difference between Diabetic and Healthy for InnerLV wall, but no difference for the outerLV wall. We can construct the pairwise comparison table:

comparisons(predmeans)
##                          Comparison Difference   SED   LSD    lwr    upr      t
## 1 Diabetic:innerLV-Diabetic:outerLV     -2.117 0.446 1.237 -3.354 -0.879 -4.750
## 2  Diabetic:innerLV-Healthy:innerLV     -1.950 0.764 2.123 -4.073  0.173 -2.551
## 3  Diabetic:innerLV-Healthy:outerLV     -2.227 0.764 2.123 -4.349 -0.104 -2.913
## 4  Diabetic:outerLV-Healthy:innerLV      0.167 0.765 2.123 -1.956  2.289  0.218
## 5  Diabetic:outerLV-Healthy:outerLV     -0.110 0.764 2.123 -2.233  2.013 -0.144
## 6   Healthy:innerLV-Healthy:outerLV     -0.277 0.446 1.237 -1.514  0.961 -0.621
##        p
## 1 0.0090
## 2 0.0464
## 3 0.0293
## 4 0.8352
## 5 0.8907
## 6 0.5683

What do the CI coverages indicate?

We can also use emmeans to plot the pairwise comparisons

em <- emmeans(lmer, ~Disease*Organ)
plot(em, pairwise = TRUE) + theme_bw()

From all the output above, what do you conclude about the interaction effect, if any!

A repeated measures design

liver <- liver %>%
  mutate(Time = as.factor(Time)) %>%
  mutate(Treatment = as.factor(Treatment)) 
glimpse(liver)
## Rows: 210
## Columns: 4
## $ Animal    <chr> "Control1", "Control1", "Control1", "Control1", "Control1", …
## $ Treatment <fct> Control, Control, Control, Control, Control, Control, Contro…
## $ Time      <fct> 0, 5, 10, 15, 20, 25, 30, 0, 5, 10, 15, 20, 25, 30, 0, 5, 10…
## $ Glucose   <dbl> 1.599, 1.279, 1.599, 1.579, 1.279, 1.439, 1.179, 0.840, 0.64…

Below these data are plotted.

Animal as a random effect

Using aov()

re_aov <- aov(Glucose ~ Treatment*Time + Error(Animal),data = liver)
summary(re_aov)
## 
## Error: Animal
##           Df Sum Sq Mean Sq F value Pr(>F)
## Treatment  3   1.90  0.6335   1.407  0.263
## Residuals 26  11.71  0.4503               
## 
## Error: Within
##                 Df Sum Sq Mean Sq F value       Pr(>F)    
## Time             6 0.7973 0.13289   8.732 0.0000000334 ***
## Treatment:Time  18 0.2539 0.01411   0.927        0.547    
## Residuals      156 2.3741 0.01522                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using lmer() (from lmerTest and lme4)

Recall that specifying the type of \(SS\) (e.g., Type I, II, or III) in an anova call only works with a model fitted using lmerTest::lmer. As we now have specified an interaction model then the type of \(SS\) calculated will have an effect on inference for unbalanced designs.

re_lmer <- lmerTest::lmer(Glucose ~ Treatment*Time + (1|Animal), data = liver)
anova(re_lmer,type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
##                 Sum Sq  Mean Sq NumDF DenDF F value        Pr(>F)    
## Treatment      0.06423 0.021409     3    26  1.4068        0.2632    
## Time           0.79731 0.132885     6   156  8.7318 0.00000003345 ***
## Treatment:Time 0.25390 0.014105    18   156  0.9269        0.5474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise comparison of time means

As above, when a design has blocking, to get summary stats using predictmeans you should fit the model using lme4::lmer():

re_lmer4 <- lme4::lmer(Glucose ~ Treatment*Time + (1|Animal),data = liver) 
predmeans <- predictmeans(model = re_lmer4 ,modelterm = "Time",
                                        pairwise = TRUE, plot = TRUE)

TASK What inference do you draw? You may also find the outputs from the code below useful.

comparisons(predmeans)
em <- emmeans(re_lmer4, ~Treatment*Time)
plot(em, pairwise = TRUE) + theme_bw() + facet_wrap(~Treatment)

Module 4

“Ecology has increasingly become a data- and model-centric discipline…”
— Trends in ecology and conservation over eight decades

Figure 5: Data analysis, statistical methods, and genetics n-grams from 1930 to 2010

R packages and datasets

It is assumed in all following examples of this module that the following code has been executed successfully.

packages used in this module

library(tidyverse) ## general functionality
library(lme4) ## mixed effects models

datasets used in this module

base_url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/"
  1. birds8
birds <- read_delim(paste(base_url, "bird_abundance.csv", sep = "")) %>%
  dplyr::select(c("OliveFarm","Management","Turdus_merula","Phylloscopus_collybita")) %>%
  pivot_longer(., c(-OliveFarm, -Management), names_to = "Species", values_to = "Count") 
  1. lobster9
lobster <- read_csv(paste(base_url, "lobster.csv", sep = ""))
  1. mice10
mice <- read_csv(paste(base_url, "autism.csv", sep = ""))

Introduction to generalised linear models (GLMs)

Recall, the simple linear regression model from Module 3:

\[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where

\[\epsilon_i \sim \text{Normal}(0,\sigma^2).\]

Here for observation \(i\)

  • \(Y_i\) is the value of the response
  • \(x_i\) is the value of the explanatory variable
  • \(\epsilon_i\) is the error term: the difference between \(Y_i\) and its expected value
  • \(\alpha\) is the intercept term (a parameter to be estimated), and
  • \(\beta_1\) is the slope: coefficient of the explanatory variable (a parameter to be estimated)

We also saw a different specification of this model in module 3:

There is an alternative, equivalent way of specifying the linear regression model which attributes the randomness directly to the response variable rather than the error \(\epsilon_i\):

\[Y_i \sim \text{Normal}(\alpha + \beta_1 x_i, \sigma^2).\]

That is, we assume the \(i^{th}\) observation’s response, \(Y_i\), comes from a normal distribution with mean \(\mu_i = \alpha + \beta_1 x_i\) and variance \(\sigma^2\).

In this case we assume that

  • the \(i^{th}\) observation’s response, \(Y_i\), comes from a normal distribution,
  • the mean of \(Y_i\) is a linear combination of the explanatory terms,
  • the variance of \(Y_i\), \(\sigma^2\), is the same for all observations, and
  • that each observation’s response is independent of all others.

But, what if we want to be a little more flexible and move away from some of these assumptions? What if we want to rid ourselves from a model with normal errors? The answer, Generalised Linear Models (GLMs).

Poisson regression

Poisson regression is commonly used as a distribution for counts. It is a discrete distribution (of positive values only) and has \(\text{Var}(Y_i) = \mu_i\) (i.e., we expect the variance to increase with the mean).

If we were to assume (as previously) that \(\mu_i = \alpha + \beta_1 x_i\) then we would be allowing \(\mu < 0\), which is not supported by the Poisson distribution. So, we use a link function to map between \(\mu_i\) and the real number line:

\[\text{log}(\mu_i) = \alpha + \beta_1 x_i.\]

So, \(\mu_i \geq 0\) and \(-\infty < \text{log}(\mu_i) < \infty\); however negative the linear predictor \(\alpha + \beta_1 x_i\) gets \(\mu_i\) will always be positive.

Equivalently, \[ \mu_i = \text{exp}(\alpha + \beta_1 x_i)\]

and

\[Y_i \sim \text{Poisson}(\mu_i)\]

Interpreting coefficients

Recall, for a linear regression model \(\mu = \alpha + \beta_1x\), when \(x=0\) \(y = \alpha\) and for every one-unit increase in \(x\), \(y\) increases by amount \(\beta_1\).

For a Poisson regression model we have \(\text{log}(\mu) = \alpha + \beta_1 x.\)

We can interpret this in the same way! That is, when \(x\) is zero, the log of the expected value of the response equals \(\alpha\) and for every one-unit increase in \(x\), the log of the expected value of the response increases by amount \(\beta_1\). But interpreting the effect of \(x\) on the log of the expected value is not straightforward.

Now, we have

\[\mu = \text{e}^{ \alpha + \beta_1 x} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{ x}.\] This implies that

  • when \(x = 0\) \(\mu = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{0} = \text{e}^{ \alpha}\times 1 = \text{e}^{ \alpha},\)
  • when \(x = 1\) \(\mu = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{1} = \text{e}^{ \alpha}\text{e}^{\beta_1} = \text{e}^{ \alpha + \beta_1},\)
  • when \(x = 2\) \(\mu = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{2} = \text{e}^{ \alpha}\text{e}^{\beta_1} \text{e}^{\beta_1} = \text{e}^{ \alpha + \beta_1 + \beta_1},\)
  • when \(x = 3\) \(\mu = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{3} = \text{e}^{ \alpha}\text{e}^{\beta_1} \text{e}^{\beta_1}\text{e}^{\beta_1} = \text{e}^{ \alpha + \beta_1 + \beta_1 + \beta_1},\) and
  • so on …

Therefore, for every n-unit increase in x, the expected value of the response is multiplied by \(\text{e}^{n\beta_1}.\)

Goodness-of-fit

Typically, we use Poisson regression to

  1. ensure the expected value is \(>0\),
  2. account for non-constant variance, and
  3. assume a discrete distribution for a discrete response.

But, how do we assess if our choice was appropriate? Use the deviance!

For a fitted Poisson regression the deviance, \(D\), is

\[D = 2 \sum^{n}_{i=1} \{ Y_{i} \log(Y_{i}/\mu_{i}) - (Y_{i}-\mu_{i}) \}\]

where if \(Y_i=0\), the \(Y_{i} \log(Y_{i}/\mu_{i})\) term is taken to be zero, and \(\mu_{i} = \exp(\hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + ... + \hat{\beta}_{p} X_{p})\) is the predicted mean for observation \(i\) based on the estimated model parameters.

The deviance is a measure of how well the model fits the data. That is, if the model fits well, the observed values \(Y_{i}\) will be close to their predicted means \(\mu_{i}\), causing both of the terms in \(D\) to be small, and so the deviance to be small. The flip side of this is that a large deviance indicates a bad fitting model.

Formally, we can test the null hypothesis that the model is correct by calculating a p-value using \[ p = \Pr(\chi^2_{n - k} > D).\]

Conditions of the chi-squared approximation

The distribution of the deviance under the null hypothesis is approximately chi-squared if the response of each observation is well approximated by a normal distribution. This holds for Poisson random variables with \(\mu_i > 5\).

However, if our chi-squared approximation assumptions are not met we should find another way.

An example: bird abundance

A recent publication Partitioning beta diversity to untangle mechanisms underlying the assembly of bird communities in Mediterranean olive groves investigates bird abundance data for a number of olive farms. Each farm is catalogued according to the level of landscape complexity (high; intermediate; low) and the type of management of the ground cover (extensive or intensive). These data are available in the birds dataset:

glimpse(birds)
## Rows: 80
## Columns: 4
## $ OliveFarm  <chr> "MORALEDAEXP", "MORALEDAEXP", "MORALEDACON", "MORALEDACON",…
## $ Management <chr> "intensive", "intensive", "extensive", "extensive", "intens…
## $ Species    <chr> "Turdus_merula", "Phylloscopus_collybita", "Turdus_merula",…
## $ Count      <dbl> 21, 14, 29, 21, 5, 31, 12, 52, 103, 10, 77, 21, 26, 9, 17, …

Fitting a poisson model we specify family = "poisson" in a call to glm(). Note that the default link function for family = "poisson" is the log link; we could also use the equivalent syntax poisson(link = log) to specify this model. Or, we could change the link function to something else (e.g., poisson(link = identity)) that makes sense.

glm_bird <- glm(Count ~ Species, data = birds, family = "poisson")
summary(glm_bird)
## 
## Call:
## glm(formula = Count ~ Species, family = "poisson", data = birds)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           3.06105    0.03422   89.45   <2e-16 ***
## SpeciesTurdus_merula  0.94537    0.04032   23.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2234.6  on 79  degrees of freedom
## Residual deviance: 1621.9  on 78  degrees of freedom
## AIC: 2030.1
## 
## Number of Fisher Scoring iterations: 5

The fitted model is therefore

\[ \log ({ \widehat{E( \operatorname{Count} )} }) = 3.061 + 0.945(\operatorname{Species}_{\operatorname{Turdus\_merula}}) \]

Interpreting the coefficients

Interpreting the coefficients above we estimate that the log of the expected number of Phylloscopus collybita is 3.06 and the log of the expected number of Turdus merula is 4.01.

But what about the expected number?

exp(coef(glm_bird))
##          (Intercept) SpeciesTurdus_merula 
##             21.35000              2.57377

Therefore, we estimate that the expected average number of Phylloscopus collybita is 21.35 and the expected average number of Turdus merula is 54.95.

Using confidence intervals:

confint <- exp(confint(glm_bird)[1,])
confint
##    2.5 %   97.5 % 
## 19.94989 22.81408

Therefore, we estimate that the expected average number of Phylloscopus collybita is between 19.95 and 22.81.

For a multiplicative interpretation of the effect use \(\text{exp}(\beta_1)\)

exp(coef(glm_bird)[2])
## SpeciesTurdus_merula 
##              2.57377

Therefore, we estimate that the expected number of Turdus merula is 2.57 \(\times\) greater than Phylloscopus collybita.

For a percentage-change interpretation use \(100 \times \left (\text{exp}(\beta_1)−1 \right )\):

100*(exp(coef(glm_bird)[2]) - 1)
## SpeciesTurdus_merula 
##              157.377

Therefore, we estimate that expected number of Turdus merula is 157.38% greater than Phylloscopus collybita.

Using 95% confidence intervals:

confint <- 100*(exp(confint(glm_bird)[2, ]) - 1)
confint
##    2.5 %   97.5 % 
## 137.9259 178.6742

So, we estimate that expected number of Turdus merula is between 137.93% and 178.67% greater than Phylloscopus collybita.

Logistic regression

We define a random variable, \(Y_i\), to have a binomial distribution if it is the number of successes from a number of independent trials, \(n\), each with the same probability of success, \(p\). It is a discrete distribution, which notes that the number of successes associated with the \(i^{th}\) observation must be an integer between \(0\) and \(n_i\). In addition, it builds in the non-constant variance of \(Y_i\) and \(\frac{Y_i}{n_i}\): \(\text{Var}(Y_i)=n_ip_i(1−p_i)\) and \(\text{Var}(\frac{Y_i}{n_i}) = \frac{p_i(1−p_i)}{n_i}\).

If we were to assume a linear relationship (as previously) that \(p_i = \alpha + \beta_1 x_i\) then we would be allowing \(p < 0\) and \(p>1\), which is not supported by the binomial distribution. So, we use a link function to map between \(p\) and the real number line:

\[\text{logit}(p_i) = \text{log}\left (\frac{p_i}{1 - p_i}\right ) = \alpha + \beta_1x_i.\] This leads to \[p_i = \frac{\exp(\alpha + \beta_1x_i)}{1 + \exp(\alpha + \beta_1x_i)}.\] and

\[Y_i \sim \text{Binomial}(n_i, p_i)\]

Interpreting coefficients

Recap, for probability \(p\) the odds are \(\frac{p}{1-p}\) and the log-odds are \(\text{log}\left (\frac{p}{1-p}\right )\):

  • when \(p=0.5\) the odds \(=1\) and the log-odds \(= 0\)
  • when \(p=1\) the odds \(\infty\) and the log-odds \(= \infty\)
  • when \(p=0\) the odds \(=0\) and the log-odds \(= -\infty\)

Recall, for a linear regression model \(\mu = \alpha + \beta_1x\), when \(x=0\) \(y = \alpha\) and for every one-unit increase in \(x\), \(y\) increases by amount \(\beta_1\).

For a logistic regression model we have \[\begin{array}{rl} \text{logit}(p) = \alpha + \beta_1 x \\ \text{log}\left (\frac{p}{1-p}\right ) = \alpha + \beta_1 x \\ \text{log}\left (\text{odds}\right ) = \alpha + \beta_1 x \\ \end{array}\]

We can interpret this as when \(x = 0\), the log-odds of success are equal to \(\alpha\) and that for every one-unit increase in \(x\) the log-odds of success increase by \(\beta_1\). But, interpreting the effect of \(x\) on the log-odds of success is not straightforward.

Now, we have

\[\text{odds} = \text{e}^{ \alpha + \beta_1 x} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{ x}.\] This implies that

  • when \(x = 0\) \(\text{odds} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{0} \text{e}^{ \alpha}\times 1 = \text{e}^{ \alpha},\)
  • when \(x = 1\) \(\text{odds} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{1} = \text{e}^{ \alpha}\text{e}^{\beta_1} = \text{e}^{ \alpha + \beta_1},\)
  • when \(x = 2\) \(\text{odds} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{2} = \text{e}^{ \alpha}\text{e}^{\beta_1} \text{e}^{\beta_1} = \text{e}^{ \alpha + \beta_1 + \beta_1},\)
  • when \(x = 3\) \(\text{odds} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{3} = \text{e}^{ \alpha}\text{e}^{\beta_1} \text{e}^{\beta_1}\text{e}^{\beta_1} = \text{e}^{ \alpha + \beta_1 + \beta_1 + \beta_1},\) and
  • so on …

Goodness-of-fit

Typically, we use logistic regression to model if we have a binary, or proportional response (e.g., success vs. failure). But, how do we assess if our choice was appropriate? Using the deviance?

Formally, we can test the null hypothesis that the model is correct by calculating a p-value using \[ p = \Pr(\chi^2_{n - k} > D).\]

Conditions of the chi-squared approximation

The distribution of the deviance under the null hypothesis is approximately chi-squared if the response of each observation is well approximated by a normal distribution. This holds for binomial random variables if the number of trials, \(n_i\), is large enough:

  • when \(p_i\) is close to 0.5, \(n_i \geq 5\) is probably sufficient,
  • but if \(p_i\) is close to 0 or 1, \(n_i\) must be much larger.

However, if our chi-squared approximation assumptions are not met we should find another way.

An example: lobsters

Let us, again, consider data from the published article Influence of predator identity on the strength of predator avoidance responses in lobsters..

The authors were interested in how a juvenile lobster’s size was related to its vulnerability to predation. In total, 159 juvenile lobsters were collected from their natural habitat in the Gulf of Maine, USA, and the length of each lobster’s carapace (upper shell) was measured to the nearest 3 mm, size. The lobsters were then tethered to the ocean floor for 24 hours. Any missing lobsters were assumed to have been consumed by a predator, while the surviving lobsters were released (i.e., survived = 1 if lobster survived, 0 otherwise). These data are available in the lobster dataset:

glimpse(lobster)
## Rows: 159
## Columns: 2
## $ size     <dbl> 42, 36, 51, 33, 33, 45, 54, 48, 39, 48, 36, 42, 45, 39, 42, 3…
## $ survived <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1…

Ungrouped model

Fitting a binomial model we specify family = "binomial" in our glm call. Note that the default link function for family = "binomial" is the logit link; we could also use the equivalent syntax binomial(link = logit)to specify this model.

glm_mod_ug <- glm(survived ~ size, family = "binomial", data = lobster)
summary(glm_mod_ug)
## 
## Call:
## glm(formula = survived ~ size, family = "binomial", data = lobster)
## 
## Coefficients:
##             Estimate Std. Error z value      Pr(>|z|)    
## (Intercept) -7.89597    1.38501  -5.701 0.00000001191 ***
## size         0.19586    0.03415   5.735 0.00000000977 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 220.41  on 158  degrees of freedom
## Residual deviance: 172.87  on 157  degrees of freedom
## AIC: 176.87
## 
## Number of Fisher Scoring iterations: 4

The fitted model is therefore

\[ \log\left[ \frac { \widehat{P( \operatorname{survived} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{survived} = \operatorname{1} )} } \right] = -7.896 + 0.196(\operatorname{size}) \]

Grouped model

The data are currently ungrouped, despite many lobsters sharing the same carapace size. Therefore, we rearrange the data set so that it is grouped:

grouped <- lobster %>%
  group_by(size) %>%
  summarise(y = sum(survived), n = length(survived), p = mean(survived))
grouped
## # A tibble: 11 × 4
##     size     y     n     p
##    <dbl> <dbl> <int> <dbl>
##  1    27     0     5 0    
##  2    30     1    10 0.1  
##  3    33     3    22 0.136
##  4    36     7    21 0.333
##  5    39    12    22 0.545
##  6    42    17    29 0.586
##  7    45    13    18 0.722
##  8    48    12    17 0.706
##  9    51     7     8 0.875
## 10    54     6     6 1    
## 11    57     1     1 1

Where,

  • size is as above,
  • y is the number of lobsters of each size that survived,
  • t is the total number of lobsters of each size, and
  • pis the proportion of lobsters of each size that survived.

Fitting a binomial model again we specify family = "binomial" in our glm call and specify our response as cbind(y, n - y):

glm_mod_gr <- glm(cbind(y, n - y) ~ size, family = "binomial", data = grouped)
summary(glm_mod_gr)
## 
## Call:
## glm(formula = cbind(y, n - y) ~ size, family = "binomial", data = grouped)
## 
## Coefficients:
##             Estimate Std. Error z value      Pr(>|z|)    
## (Intercept) -7.89597    1.38501  -5.701 0.00000001191 ***
## size         0.19586    0.03415   5.735 0.00000000977 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52.1054  on 10  degrees of freedom
## Residual deviance:  4.5623  on  9  degrees of freedom
## AIC: 32.24
## 
## Number of Fisher Scoring iterations: 4

The fitted model is again

\[ \log\left[ \frac { \widehat{P( \operatorname{survived} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{survived} = \operatorname{1} )} } \right] = -7.896 + 0.196(\operatorname{size}) \]

Interpreting the coefficients

Interpreting the coefficients above we estimate that the log-odds of a juvenile lobster surviving are -8 (use your common sense to ascertain if interpreting the intercept is sensible).

We estimate that for every 1mm increase in carapace length the log-odds of a juvenile lobster surviving increase by 0.196.

What about the odds?

exp(coef(glm_mod_gr))
##  (Intercept)         size 
## 0.0003722407 1.2163540760

Therefore, we estimate that the odds of a juvenile lobster surviving are 0.000372 (use your common sense to ascertain if interpreting the intercept is sensible).

We estimate that for every 1mm increase in carapace length the odds of a juvenile lobster surviving are multiplied by 1.22.

What about for a 5mm increase in carapace length?

exp(5*coef(glm_mod_gr)[2])
##     size 
## 2.662564

Therefore, we estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving are multiplied by 2.66.

Using 95% confidence intervals:

ci <- exp(5*confint(glm_mod_gr)[2,])
ci
##    2.5 %   97.5 % 
## 1.944478 3.811167

Therefore, we estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving are multiplied by between 1.94 and 3.81.

For a percentage-change interpretation we use \(100 \times \left (\text{exp}(x\beta_1)−1 \right )\):

100*(exp(5*coef(glm_mod_gr)[2]) - 1)
##     size 
## 166.2564
confint <- 100*(exp(5*confint(glm_mod_gr)[2, ]) - 1)
confint
##     2.5 %    97.5 % 
##  94.44778 281.11670

We estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving increase by 166.26%. We estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving increase between 94.45% and 281.12%.

Model diagnostics

Residuals

Recall from module 2, where we looked at residual plots to check our assumptions for a linear model. In a similar way we can use residuals to check how appropriate our GLM is (i.e., to help diagnose the overall goodness-of-fit and to see if our model assumptions are met). Often, Pearson and deviance residuals are used in model diagnostics for generalized linear models.

Response residuals are the conventional residual on the response level. That is, the fitted residuals are transformed by taking the inverse of the link function. Think back to linear models where the link function was set as the identity.

Deviance residuals represent the contributions of individual samples to the deviance, D (see above).

Pearson residuals are calculated by normalizing the raw residuals (i.e., expected - estimate) by the square root of the estimate.

Luckily, we can calculate all three directly in R. Below we do that for the bird abundance model.

## response residuals
resids_response <- residuals(glm_bird, type = "response")
## deviance residuals
resids_deviance <- residuals(glm_bird, type = "deviance")
## pearson residuals
resids_pearson <- residuals(glm_bird, type = "pearson")

Plotting all three types we can asses how appropriate our chosen model is.

So, what do you see? Look back at the assumptions of a Poisson model. What do you conclude from the plots above?

The quantile residual Q-Q plot

Recall from Module 2 that a Normal quantile-quantile (Q-Q) plot is used to check overall similarity of the observed distribution of the residuals to that expected under the model (i.e., Gaussian). An alternative to a Normal Q-Q plot for a GLM fit is a quantile residual Q-Q plot of observed versus expected quantile residuals. Quantile residuals are, typically, the residuals of choice for GLMs when the deviance and Pearson residuals can be grossly non-normal. It is suggested that quantile residuals are the only useful residuals for binomial or Poisson data when the response takes on only a small number of distinct values. We can use the statmod::qresiduals() function to compute these residuals.

Deviance (using the chi-squared approach)

For our Poisson model

## extract the residual deviance
D <- glm_bird$deviance
D
## [1] 1621.948
## extract the residual degrees of freedom (n-k)
df <- glm_bird$df.residual
df
## [1] 78

Therefore, to test the relevant null hypothesis (that the model is correct) we use

1 - pchisq(D, df)
## [1] 0

We have strong evidence to reject the null hypothesis; suggesting a lack of fit! BUT are our chi-squared approximation assumptions met? If not, we might take a simulation based approach. This is, however, beyond the scope of this course.

For our binomial model

## extract the residual deviance
D <- glm_mod_gr$deviance
D
## [1] 4.562321
## extract the residual degrees of freedom (n-k)
df <- glm_mod_gr$df.residual
df
## [1] 9

Therefore, to test the relevant null hypothesis (that the model is correct) we use

1 - pchisq(D, df)
## [1] 0.8706732

Here, we have no evidence to against our model being “correct”. BUT are our chi-squared approximation assumptions met? If not, we might take a simulation based approach. This is, however, beyond the scope of this course.

A summary of GLMs

The three distributions we’ve covered above are:

  1. Linear regression: \(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\) where \(\mu_i = \alpha + \beta_1 x_i\)
  2. Poisson regression: \(Y_i \sim \text{Poisson}(\mu_i)\) where \(\text{log}(\mu_i) = \alpha + \beta_1 x_i\)
  3. Logistic regression: \(Y_i \sim \text{Binomial}(n_i, p_i)\) where \(\text{logit}(p_i) = \alpha + \beta_1 x_i\)

What would happen if we wanted to add extra explanatory terms (e.g., \(z_i\))? Then,

  1. Linear regression: \(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\) where \(\mu_i = \alpha + \beta_1 x_i + \beta_2 z_i\)
  2. Poisson regression: \(Y_i \sim \text{Poisson}(\mu_i)\) where \(\text{log}(\mu_i) = \alpha + \beta_1 x_i+ \beta_2 z_i\)
  3. Logistic regression: \(Y_i \sim \text{Binomial}(n_i, p_i)\) where \(\text{logit}(p_i) = \alpha + \beta_1 x_i+ \beta_2 z_i\)

What about interactions (e.g., \(x_iz_i\))?

  1. Linear regression: \(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\) where \(\mu_i = \alpha + \beta_1 x_i + \beta_2 z_i + \beta_3 x_iz_i\)
  2. Poisson regression: \(Y_i \sim \text{Poisson}(\mu_i)\) where \(\text{log}(\mu_i) = \alpha + \beta_1 x_i+ \beta_2 z_i + \beta_3 x_iz_i\)
  3. Logistic regression: \(Y_i \sim \text{Binomial}(n_i, p_i)\) where \(\text{logit}(p_i) = \alpha + \beta_1 x_i+ \beta_2 z_i + \beta_3 x_iz_i\)

Building a GLM

Assume the observations are independent of one another, then,

  1. Choose a distribution for the response. For example, Normal, Poisson, or Binomial.

  2. Choose a parameter to relate to explanatory terms. For example, \(\mu_i\), \(\mu_i\), or \(p_i\).

  3. Choose a link function. For example, identity, log, or logit.

  4. Choose explanatory terms

  5. Estimate additional parameters. For example, \(\sigma^2\).

Other distributions

We are not restricted to the three distributions above. Many others exist:

  • Gamma and inverse-Gaussian, for continuous responses on the interval \([0,\infty)\)
  • Beta, for continuous responses on the interval \([0,1]\)
  • Quasi-Poisson\(^{+}\), typically for modelling overdispersed count data (i.e., when \(\text{Var}(Y) > E(Y)\)) where \(\text{Var}(Y)\) is a linear function of \(E(Y)\)
  • Negative binomial, for discrete responses on \((0,1,2,\cdots)\) (can also be used for overdispersed count data), with \(\text{Var}(Y) \geq E(Y)\) and where \(\text{Var}(Y)\) is a quadratic function of \(E(Y)\)
  • Quasi-binomial\(^{+}\), a model for situation when \(\text{Var}(Y) > \text{Var}(Y_{Y \sim \text{Binomial}})\)
  • Beta-binomial, a Binomial-type model where \(p \sim \text{Beta}(a, b)\)
Distribution Notation Mean \(\mu = E(Y)\) Variance \(\sigma^2 = \text{Var}(Y)\) Linear predictor (w. a typical link function)
Gaussian \(Y\sim\textbf{Normal}(\mu, \sigma^2)\) \(\mu\) \(\sigma^2\) I\((\mu) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)
Poisson \(Y \sim \textbf{Poisson}(\mu)\) where \(\mu = \text{rate}\) \(\mu\) \(\mu\) log\((\mu) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)
Binomial \(Y \sim \textbf{Bonomial}(\text{n}, p)\) where \(\text{n} = \text{number of trials}\) and \(p = \text{probability of sucess}\) \(\text{n}p\) \(\text{n}p (1-p)\) logit\((p) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)
Gamma \(Y \sim \textbf{Gamma}(k, \theta = \frac{1}{\text{rate}})\) where \(k = \text{shape}\) and \(\theta = \text{scale}\) \(k\theta\) \(k\theta^2\) log\((E(Y)) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)
Beta \(Y \sim \textbf{Beta}(a, b)\) where \(a = \text{shape}\) and \(b = \text{scale}\) \(\frac{a}{a+b}\) \(\frac{ab}{(a+b)^2(a+b+1)}\) log\((E(Y)) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)
Negative binomial \(Y \sim \textbf{NB}(r, p)\) where \(\text{r} = \text{number of successes until the experiment is stopped}\) and \(p = \text{probability of sucess}\) or \(Y \sim \textbf{NB}(k, p)\) where \(\text{k} = \text{number of failures}\) given \(p = \text{probability of sucess}\) \(\frac{r(1-p)}{p}\) or \(\mu = k\frac{p}{1-p}\) \(\frac{r(1-p)}{p^2}\) or \(\mu + \frac{\mu^2}{k}\) log\((E(Y)) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)
Beta-binomial \(Y \sim \textbf{BetaBin}(\text{n},a, b)\) where \(\text{n} = \text{number of trials}\) and \(p = \frac{a}{a + b}, \text{the probability of success}\) \(\frac{\text{n} a}{a+b}= \text{n}p\) \(\frac{\text{n} a b(a + b + \text{n})}{(a+b)^2(a + b + 1)}\) logit\((p) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\)

\(^{+}\) Note that the Quasi-… distributions are not really proper distributions; they are simply models describing a mean-variance relationship beyond that offered by the base distribution.

TASK Study this cheatsheet and find relevant examples of response variables from your field of study for each distribution.

Note glm() is only the beginning.

R function Use
gam() Fit a generalised additive model. The R package mgcv must be loaded
nlme() Fit linear and non-linear mixed effects models. The R package nlme must be loaded
gls() Fit generalised least squares models. The R package nlme must be loaded

Generalised linear mixed-effects models (GLMMMs)

Recall module 3 when we covered fitting linear models with random effects (lmer). The fixed effects and random effects were specified via the model formula. Now we’ve covered GLMs we can include random effects here too and fit GLMMs!

Recall

  • Fixed effects: terms (parameters) in a statistical model which are fixed, or non-random, quantities (e.g., treatment group’s mean response). For the same treatment, we expect this quantity to be the same from experiment to experiment.
  • Random effects: terms (parameters) in a statistical model which are considered as random quantities or variables (e.g., block id). Specifically, terms whose levels are a representative sample from a population, and where the variance of the population is of interest should be allocated as random. Setting a block as a random effect allows us to infer variation between blocks as well as the variation between experimental units within blocks.

Fitting a GLMM

The authors of this paper transplanted gut microbiota from human donors with Autism Spectrum Disorder (ASD) or typically developing (TD) controls into germ-free mice. Faecal samples were collected from three TD and five ASD donors and were used to colonise GF male and female mice from strain C57BL/6LJ. Individuals colonized by the same donor were allowed to breed. Adult offspring mice were behavior tested; tests included marble burying (MB), open-field testing (OFT), and ultrasonic vocalization (USV).

glimpse(mice)
## Rows: 206
## Columns: 3
## $ Donor     <chr> "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "A24-n…
## $ Treatment <chr> "NT Female", "NT Female", "NT Female", "NT Female", "NT Male…
## $ MB_buried <dbl> 3, 8, 1, 3, 2, 6, 4, 2, 9, 7, 5, 5, 11, 13, 6, 14, 9, 5, 6, …

We should separate out the treatment values:

mice <- mice %>%
  separate(., col = Treatment, into = c("Diagnosis", "Sex"))
mice
## # A tibble: 206 × 4
##    Donor   Diagnosis Sex    MB_buried
##    <chr>   <chr>     <chr>      <dbl>
##  1 C1      NT        Female         3
##  2 C1      NT        Female         8
##  3 C1      NT        Female         1
##  4 C1      NT        Female         3
##  5 C1      NT        Male           2
##  6 C1      NT        Male           6
##  7 C1      NT        Male           4
##  8 C1      NT        Male           2
##  9 C1      NT        Female         9
## 10 A24-new ASD       Male           7
## # ℹ 196 more rows

Recall, using lmer to fit a LMM

lmer <- lmerTest::lmer(MB_buried ~ Sex * Diagnosis + (1|Donor), data = mice)
car::Anova(lmer, type = 2)
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: MB_buried
##                Chisq Df Pr(>Chisq)   
## Sex           3.6329  1   0.056648 . 
## Diagnosis     8.6944  1   0.003192 **
## Sex:Diagnosis 2.9745  1   0.084588 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But is the constant error variance appropriate? Below we plot the model residuals.

TASK Looking at the output above, do you think the assumptions of the linear model are met?

Using glmer

glmer <- lme4::glmer(MB_buried ~ Sex * Diagnosis + (1|Donor), data = mice, family = poisson(link = "log"))
car::Anova(glmer, type = 2)
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: MB_buried
##                 Chisq Df Pr(>Chisq)    
## Sex            9.9452  1  0.0016127 ** 
## Diagnosis     10.5836  1  0.0011410 ** 
## Sex:Diagnosis 12.5904  1  0.0003877 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TASK Looking at the output above, do you think the assumptions of the Poisson model are met?

Module 5

R packages and datasets

It is assumed in all following examples of this module that the following code has been executed successfully.

packages used in this module

library(tidyverse) ## general functionality
library(palmerpenguins) ## penguins data
library(GGally) ## nice looking pairs plots
library(factoextra) ## multivariate analysis
library(vegan) ## multivariate analysis
library(ape) ## dendograms
library(MASS) ## linear discriminant analysis
library(ggfortify) ## plotting
library(pheatmap) ## plotting
library(ade4) ## plotting

datasets used in this module

base_url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/"
  1. palmerpenguins::penguins
data(penguins, package = "palmerpenguins")
  1. ants
ants <- read_csv(paste(base_url, "pitfalls.csv", sep = ""))
  1. north_island
north_island <- read_csv(paste(base_url, "north_islands_distances.csv", sep = "")) %>% 
  column_to_rownames(var = "...1")
  1. ekman11
ekman <- read_csv(paste(base_url, "ekman.csv", sep = ""))
  1. eurodist
data("eurodist", package = "datasets")
  1. HairEyeColor12
data("HairEyeColor", package = "datasets")
  1. diabetes
diabetes <- read_csv(paste(base_url, "diabetes.csv", sep = ""))

Clustering

“Clusters may be described as continuous regions of (a) space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points.”
— Everitt, 1980
“Cluster analysis has the apparently simple aim of finding clusters in a data cloud of sampling units in the absence of any a priori information about which point belongs in which cluster. This apparently unambitious aim is unfortunately fraught with problems.”
— Brian McArdle, STATS302

In brief, cluster analysis involves using measures of (dis)similarity and distances to help us define clusters. We use this to uncover hidden or latent clustering by partitioning the data into tighter sets. There are two main methods for doing this: 1) divisive methods use nonparametric algorithms (such as k-means) to split data into a small number of clusters, and 2) agglomerative methods that cluster cases and/or variables into a hierarchy of sets (e.g., hierarchical clustering). We can use to resampling-based bootstrap methods validate clusters.

Divisive (partitioning) methods.

For a single run of a partitioning method, the number of clusters ( \(k\) ) is typically fixed beforehand. Typically, there are only two steps to a partitioning method:

  1. an initial allocation (usually rather arbitrary) into \(k\) preliminary clusters, and then
  2. reallocation of each point either to the closest centroid, or so as to optimise some property of the clusters. This is repeated until there is no further improvement.

The initial allocation is usually started by choosing \(k\) sampling units to use as seeds to set the clusters. There are a number of ways to choose these seeds. These seeds are used as the initial centres of the clusters, points are allocated to the nearest cluster centre, and in most programs the cluster centroid is adjusted as they are added.

K-means

K-means clustering involves defining clusters so that the overall variation within a cluster (known as total within-cluster variation) is minimized. How do we define this variation? Typically, using Euclidean distances; the total within-cluster variation, is in this case, is defined as the sum of squared distances Euclidean distances between observations and the corresponding cluster centroid.

In summary, the k-means procedure is

  • The number of clusters (k) are specified
  • k objects from the dataset are selected at random and set as the initial cluster centers or means
  • Each observation is assigned to their closest centroid (based on the Euclidean distance between the object and the centroid)
  • For each of the k clusters the cluster centroid is then updated based on calculating the new mean values of all the data points in the cluster
  • Repeat the two previous steps until 1) the cluster assignments stop changing or 2) the maximum number of iterations is reached

Identifying the appropriate \(k\) is important because too many or too few clusters impedes viewing overall trends. Too many clusters can lead to over-fitting (which limits generalizations) while insufficient clusters limits insights into commonality of groups.

There are assorted methodologies to identify the appropriate \(k\). Tests range from blunt visual inspections to robust algorithms. The optimal number of clusters is ultimately a subjective decision.

K-means: an example using the palmerpenguins data

## getting rid of all NAs 
penguins <- penguins %>% drop_na()
## introducing a new package GGally, please install
## using install.packages("GGally")
library(GGally)
penguins %>%
  dplyr::select(species, where(is.numeric)) %>% 
  ggpairs(columns = c("flipper_length_mm", "body_mass_g", 
                     "bill_length_mm", "bill_depth_mm")) 

We see that a lot of these variables (e.g., flipper_length_mm, body_mass_g, and bill_length_mm) are relatively strongly (positively) related to one another. Could they actually be telling us the same information? Combined we could think of these three variables all telling us a little about bigness of penguin. Is there a way we could reduce these three variables, into say 1, to represent the bigness of a penguin. We may not need all the information (variation) captured by these variables, but could get away with fewer new uncorrelated variables that represent basically the same information (e.g., penguin bigness), thereby, reducing the dimensionality of the data (more on this later).

## create a data frame of what we're interested in
df <- penguins %>%
  dplyr::select(where(is.numeric), -year)

We use the kmeans() function.

The first argument of kmeans() should be the dataset you wish to cluster. Below we use data frame df, the penguin data discussed above. But how many clusters do we choose? Let’s try 1 to 5… (i.e., using the centers argument). Setting nstart = 25 means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation.

## set the seed so we all start off in the same place
set.seed(4321)
## one cluster
k1 <- kmeans(df, centers = 1, nstart = 25)
## two clusters
k2 <- kmeans(df, centers = 2, nstart = 25)
## three clusters
k3 <- kmeans(df, centers = 3, nstart = 25)
## four clusters
k4 <- kmeans(df, centers = 4, nstart = 25)
## five clusters
k5 <- kmeans(df, centers = 5, nstart = 25)

The kmeans() function returns a list of components:

  • cluster, integers indicating the cluster to which each observation is allocated
  • centers, a matrix of cluster centers/means
  • totss, the total sum of squares
  • withinss, within-cluster sum of squares, one component per cluster
  • tot.withinss, total within-cluster sum of squares
  • betweenss, between-cluster sum of squares
  • size, number of observations in each cluster

Choosing the number of clusters

We have an idea there may be 3 clusters, perhaps, but how do we know this is the best fit? Remember it’s a subjective choice and we’ll be looking at a few pointers

Visual inspection method

library(factoextra) ## a new package for kmeans viz, please install
p1 <- fviz_cluster(k1, data = df)
p2 <- fviz_cluster(k2, data = df)
p3 <- fviz_cluster(k3, data = df)
p4 <- fviz_cluster(k4, data = df)
p5 <- fviz_cluster(k5, data = df)

## for arranging plots
library(patchwork) 
(p1| p2| p3)/ (p4 | p5)

Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.

df %>%
  mutate(cluster = k3$cluster,
         species = penguins$species) %>%
  ggplot(aes(flipper_length_mm, bill_depth_mm, color = factor(cluster), label = species)) +
  geom_text()

Elbow method

Optimal clusters are at the point in which the knee “bends” or in mathematical terms when the marginal total within sum of squares (tot.withinss) for an additional cluster begins to decrease at a linear rate

This is easier to see via a plot:

fviz_nbclust(df, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

There is a pretty obvious inflection (elbow) at 2 clusters, but maybe at 3 too. We can rule out an optimal number of clusters above 3 as there is then only a minimal marginal reduction in total within sum of squares. However, the model is ambiguous on whether 2 or 3 clusters is optimal…

Silhouette method

# Silhouette method
fviz_nbclust(df, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

Gap method

# Gap statistic
# recommended value: nboot = 500 for your analysis (it will take a while)
set.seed(123) ## remove this
fviz_nbclust(df, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

Basically it’s up to you to collate all the suggestions and make and informed decision

## Trying all the cluster indecies AHHHHH
library(NbClust)
cluster_30_indexes <- NbClust(data = df, distance = "euclidean", min.nc = 2, max.nc = 9, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 4 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
fviz_nbclust(cluster_30_indexes) +
      theme_minimal() +
      labs(title = "Frequency of Optimal Clusters using 30 indexes in NbClust Package")
## Error in if (class(best_nc) == "numeric") print(best_nc) else if (class(best_nc) == : the condition has length > 1

Not obvious, basically still undecided between 2 and 3, but according to the absolute majority rule the “best” number is 3

Hierarchical agglomerative clustering.

Most of the hierarchical methods are agglomerative and they operate in the same way:

  1. All sampling units that are zero distance (however this is defined) apart are fused into clusters.
  2. The threshold for fusion is then raised from zero until two clusters (they may be individual points) are found that are close enough to fuse.
  3. The threshold is raised, fusing the clusters as their distance apart is reached until all the clusters have been fused into one big one.

So, the close clusters are fused first, then those further apart, till all have been fused. This process allows the history of the fusions, the hierarchy, to be displayed as a dendrogram. This is an advantage of the agglomerative methods, if the data have a nested structure these techniques lead to a useful way of displaying it.

Unlike the k-means most of the agglomerative techniques can use a broad range of similarity or distance measures. This, however, means that considerable care must be taken to choose the appropriate one; different measures often lead to different results.

Single linkage (nearest neighbour) clustering.

Single Linkage (nearest neighbour/minimal jump): Computes the distance between clusters as the smallest distance between any two points in the two clusters.

Single Linkage identifies clusters based on how far apart they are at their closest points. This means that if there are any intermediate points then single linkage will fuse the groups without leaving any trace of their separate identities. This is called chaining and it often leads to uninformative dendrograms.

If, however, the clusters are well separated in the data, then single linkage can handle groups of different shapes and sizes easily. In addition, single linkage will give the same clustering after any monotonic transformation of the distance measure (i.e., it is fairly robust to the choice of measure).

Complete linkage (farthest neighbour) clustering.

Instead of measuring the distance between two clusters as that between their two nearest members complete Linkage (maximum jump) uses that between the two farthest members (i.e., it calculates the maximum distance between two points from each cluster.)

The resulting clusters are often compact, spherical and well defined. Complete linkage can, however, be sensitive to tied distances. Although, it too, is robust to a certain amount of measurement error and choice of distance.

Group average linkage (UPGMA)

Group average linkage (UPGMA) is probably the most popular hierarchical clustering method! You might like to think of it as an attempt to avoid the extremes of the single and complete linkage methods as the distance between two clusters is the average of the distances between the members of the two groups. As a result this method tends to produce compact spherical clusters.

Ward’s method (incremental sums of squares, minimum variance, agglomerative sums of squares).

Ward’s method is the hierarchical version of the k-means partitioning method. At each fusion it attempts to minimise the increase in total sum of squared distances within the clusters. This is equivalent to minimising the sum of squared within cluster deviations from the centroids

Ward’s method, at any one stage, can only fuse those clusters already in existence (i.e., it is not allowed to reallocate points). A bad start to the agglomeration process can place the algorithm on a path from which it can never reach the global optimum for a given number of clusters. Its chief flaw is a tendency to form clusters of equal size, regardless of the true number. Like the complete linkage and group average methods it is also biased towards forming spherical clusters. Despite this, Ward’s method performs well in simulations and is often method of choice.

Hierarchical clustering: an example

In summary

  • Start with a matrix of distances, (or similarities) between pairs of observations (cases)
  • Choice of distance measure key first step
  • Algorithm:
    • Initial \(n\) singleton clusters
    • Scan distance matrix for two closest individuals, group them together
    • Compute distance from cluster of size 2 to remaining \(n-1\) singleton clusters
Method Pros Cons
Single linkage number of clusters comb-like trees.
Complete linkage compact clusters one obs. can alter groups
Average linkage similar size and variance not robust
Centroid robust to outliers smaller number of clusters
Ward minimising an inertia clusters small if high variability

Ants

Data were collected on the distribution of ant species at 30 sites across the Auckland region using pitfall traps. Twenty pitfall traps at each site were left open for ten days and the number of individuals captured counted for the four most abundant species: Nylanderia spp, Pheidole rugosula, Tetramorium grassii, and Pachycondyla sp.

glimpse(ants)
## Rows: 30
## Columns: 8
## $ Location <chr> "West", "West", "West", "West", "West", "West", "West", "West…
## $ Habitat  <chr> "Forest", "Grass", "Urban", "Forest", "Grass", "Forest", "Gra…
## $ Month    <dbl> 1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1…
## $ Site     <chr> "WF1", "WG1", "WU1", "WF2", "WG2", "WF3", "WG3", "WU3", "CF1"…
## $ Nyl      <dbl> 0, 0, 3, 0, 5, 0, 0, 0, 0, 0, 1, 0, 0, 22, 15, 0, 0, 10, 2, 0…
## $ Phe      <dbl> 0, 2, 7, 0, 0, 0, 3, 1, 0, 3, 0, 0, 7, 109, 1, 0, 13, 47, 0, …
## $ Tet      <dbl> 0, 7, 0, 0, 25, 0, 2, 0, 0, 22, 5, 0, 30, 54, 35, 0, 14, 7, 4…
## $ Pac      <dbl> 157, 37, 0, 31, 0, 21, 1, 0, 1, 2, 1, 3, 1, 0, 4, 1, 2, 0, 0,…

Data are species counts, so we will use Bray Curtis measure:

pitfall.dist <- vegan::vegdist(ants[,5:8], method = "bray", binary = FALSE)
factoextra::fviz_dist(pitfall.dist)

Single-linkage

single <- ants[,5:8] %>%
  vegan::vegdist(., method = "bray", binary = FALSE) %>%
  hclust(method = "single")
plot(single, labels = ants$Site)

Maximum linkage

complete <- ants[,5:8] %>%
  vegan::vegdist(., method = "bray", binary = FALSE) %>%
  hclust(method = "complete")
plot(complete, labels = ants$Site)

Average linkage (UPGMA)

average <- ants[,5:8] %>%
  vegan::vegdist(., method = "bray", binary = FALSE) %>%
  hclust(method = "average")
plot(average, labels = ants$Site)

Ward’s method

ward <- ants[,5:8] %>%
  vegan::vegdist(., method = "bray", binary = FALSE) %>%
  hclust(method = "ward.D")
plot(ward, labels = ants$Site)

WHAT ARE DENDROGRAMS GOOD FOR? Suggesting clusters for further study…

Using the function cutree() to split into clusters and plot:

ants$clust4 <- cutree(ward, k = 4)
library(ape)   ## install
pitfall.phylo <- as.phylo(ward)
pitfall.phylo$tip.label <- ants$Site
## Set colours 
colours  <-  c("red","blue","green","black")
plot(pitfall.phylo, cex = 0.6, tip.color = colours[ants$clust4], label.offset = 0.05) 

Principal Component Analysis (PCA)

Reduction of dimensions is needed when there are far too many features in a dataset, making it hard to distinguish between the important ones that are relevant to the output and the redundant or not-so important ones. Reducing the dimensions of data is called dimensionality reduction.

So the aim is to find the best low-dimensional representation of the variation in a multivariate (lots and lots of variables) data set, but how do we do this?

PCA is a member of a family of techniques for dimension reduction (ordination).

The word ordination was applied to dimension reduction techniques by botanical ecologists whose aim was to identify gradients in species composition in the field. For this reason they wanted to reduce the quadrat × species (observations × variables) data matrix to a single ordering (hence ordination) of the quadrats which they hoped would reflect the underlying ecological gradient.

One way is termed Principal Component Analysis (PCA). PCA is a feature extraction method that reduces the dimensionality of the data (number of variables) by creating new uncorrelated variables while minimizing loss of information on the original variables. It is a technique for the analysis of an unstructured sample of multivariate data. Its aim is to display the relative positions of the observations in the data cloud in fewer dimensions (while losing as little information as possible) and to help give insight into the way the observations vary. It is not a hypothesis testing technique (like t-test or Analysis of Variance); it is an exploratory, hypothesis generating tool that describes patterns of variation, and suggests relationships that should be investigated further.

Think of a baguette. The baguette pictured here represents two data dimensions: 1) the length of the bread and 2) the height of the bread (we’ll ignore depth of bread for now). Think of the baguette as your data; when we carry out PCA we’re rotating our original axes (x- and y-coordinates) to capture as much of the variation in our data as possible. This results in new uncorrelated variables that each explain a % of variation in our data; the procedure is designed so that the first new variable (PC1) explains the most, the second (PC2) the second most and so on.

Now rather than a baguette think of data; the baguette above represent the shape of the scatter between the two variables plotted below. The rotating grey axes represent the PCA procedure, essentially searching for the best rotation of the original axes to represent the variation in the data as best it can. Mathematically the Euclidean distance (e.g., the distance between points \(p\) and \(q\) in Euclidean space, \(\sqrt{(p-q)^2}\)) between the points and the rotating axes is being minimized (i.e., the shortest possible across all points), see the blue lines. Once this distance is minimized across all points we “settle” on our new axes (the black tiled axes).

Luckily we can do this all in R!

One problem with PCA is that the components are not scale invariant. That means if we change the units in which our variables are expressed, we change the components; and not in any simple way either. So, every scaling or adjustment of the variables in preparation for the analysis could (and usually does ) produce a separate component structure. It is therefore important to choose a standardisation or transformation carefully: PCA will give different results depending on whether we analyse the covariance matrix, where the data have merely been centred (corrected for the column, variable, mean), or the correlation matrix, where the data have been standardised to z -scores (centred and converted into standard deviation units). This is particularly important, as many computer programs to do PCA automatically analyse the correlation matrix . If you do not want that standardisation; you may have to explicitly ask for the covariance matrix. As you would expect, the results from the two analyses will usually be very different, see below.

Examples in R

The palmerpenguins data

When carrying out PCA we’re only interested in numeric variables, so let’s just plot those. We can use the piping operator %>% to do this with out creating a new data frame

library(GGally)
penguins %>%
  dplyr::select(species, where(is.numeric)) %>% 
  ggpairs(aes(color = species),
        columns = c("flipper_length_mm", "body_mass_g", 
                     "bill_length_mm", "bill_depth_mm")) 

Using prcomp()

There are three basic types of information we obtain from Principal Component Analysis:

  • PC scores: the coordinates of our samples on the new PC axis: the new uncorrelated variables (stored in pca$x)

  • Eigenvalues: (see above) represent the variance explained by each PC; we can use these to calculate the proportion of variance in the original data that each axis explains

  • Variable loadings (eigenvectors): these reflect the weight that each variable has on a particular PC and can be thought of as the correlation between the PC and the original variable

Before we carry out PCA we should scale out data. WHY?

pca <- penguins %>% 
  dplyr::select(where(is.numeric), -year) %>% ## year makes no sense here so we remove it and keep the other numeric variables
  scale() %>% ## scale the variables
  prcomp()
## print out a summary
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.6569 0.8821 0.60716 0.32846
## Proportion of Variance 0.6863 0.1945 0.09216 0.02697
## Cumulative Proportion  0.6863 0.8809 0.97303 1.00000

This output tells us that we obtain 4 principal components, which are called PC1 PC2, PC3, and PC4 (this is as expected because we used the 4 original numeric variables!). Each of these PCs explains a percentage of the total variation (Proportion of Variance) in the dataset:

  • PC1 explains \(\sim\) 68% of the total variance, which means that just over half of the information in the dataset (5 variables) can be encapsulated by just that one Principal Component.
  • PC2 explains \(\sim\) 19% of the variance.
  • PC3 explains \(\sim\) 9% of the variance.
  • PC4 explains \(\sim\) 2% of the variance.

From the Cumulative Proportion row we see that by knowing the position of a sample in relation to just PC1 and PC2 we can get a pretty accurate view on where it stands in relation to other samples, as just PC1 and PC2 explain 88% of the variance.

The loadings (relationship) between the initial variables and the principal components are stored in pca$rotation:

pca$rotation
##                          PC1         PC2        PC3        PC4
## bill_length_mm     0.4537532 -0.60019490 -0.6424951  0.1451695
## bill_depth_mm     -0.3990472 -0.79616951  0.4258004 -0.1599044
## flipper_length_mm  0.5768250 -0.00578817  0.2360952 -0.7819837
## body_mass_g        0.5496747 -0.07646366  0.5917374  0.5846861

Here we can see that bill_length_mm has a strong positive relationship with PC1, whereas bill_depth_mm has a strong negative relationship. Both fliper_length_mm and body_mass_g also have a strong positive relationship with PC1.

Plotting this we get

The new variables (PCs) are stored in pca$x, lets plot some of them alongside the loadings using a biplot. For PC1 vs PC2:

library(factoextra) ## install this package first
fviz_pca_biplot(pca, geom = "point") +
      geom_point (alpha = 0.2)

Now for PC2 vs PC3

fviz_pca_biplot(pca, axes = c(2,3),geom = "point") +
      geom_point (alpha = 0.2)

But how many PCs (new variables) do we keep? The whole point of this exercise is to reduce the number of variables we need to explain the variation in our data. So how many of these new variables (PCs) do we keep?

To assess this we can use the information printed above alongside a screeplot:

fviz_screeplot(pca)

Principal components from the original variables

Recall that the principal components are a linear combination of the (statndardised) variables. So for PC1

loadings1 <- pca$rotation[,1]
loadings1
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##         0.4537532        -0.3990472         0.5768250         0.5496747

Therefore, the first Principle Component will be \(0.454\times Z1 -0.399 \times Z2 + 0.5768 \times Z3 + 0.5497 \times Z3\) where \(Z1\), \(Z2\), \(Z3\). and \(Z4\) are the scaled numerical variables form the penguins dataset (i.e., bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g). To compute this we use R:

scaled_vars <- penguins %>% 
  dplyr::select(where(is.numeric), -year) %>% 
  scale() %>%
  as_tibble()
## By "Hand"
by_hand <- loadings1[1]*scaled_vars$"bill_length_mm" + 
  loadings1[2]*scaled_vars$"bill_depth_mm" + 
  loadings1[3]*scaled_vars$"flipper_length_mm" + 
  loadings1[4]*scaled_vars$"body_mass_g"
## From PCA
pc1 <- pca$x[,1]
plot(by_hand,pc1)

TASK Simulate some white noise using the code below. Now investigate any potential correlations in the data and carry out PCA. What do you conclude?

set.seed(1234) ## just for reproduciblity
noise <- as_tibble(replicate(10,rnorm(200, mean = 50, sd = 10)))

Multidimensional Scaling (MDS)

Multidimensional scaling (MDS) is an extended family of techniques that try to reproduce the relative positions of a set of points in a reduced space given, not the points themselves, but only a matrix with interpoint distances ( dissimilarities ). These distances might be measured with error, or even be non-Euclidean.

Metric Scaling

Metric scaling tries to produce a set of coordinates (a configuration of points) in a reduced number of dimensions whose matrix of interpoint Euclidean distances approximates the original dissimilarity matrix as closely as possible. Principal coordinates (PCO) is one metric scaling technique (it is sometimes called classical or Torgerson scaling ).

Examples in R

Consider data which are not represented as points in a feature space:

  • Where we are only provided with (dis)similarity matrices between objects (e.g., chemical compounds, images, trees, or other complex objects)
  • Where there are no obvious coordinates in (continuous) n-dimensional space .
Distances (in km) between North Island cities

TASK On a piece of paper roughly draw the shape of the North Island (of Aotearoa | New Zealand). From memory alone mark the location of the following cities Auckland, Gisborne, Hamilton, Hastings, Napier, Rotorua, Tauranga, Whanganui, Wellington, Whakatane, and Whangarei.

glimpse(north_island) 
## Rows: 11
## Columns: 11
## $ Auckland   <dbl> 0, 505, 126, 441, 421, 235, 205, 457, 658, 298, 165
## $ Gisborne   <dbl> 505, 0, 394, 235, 215, 286, 298, 467, 538, 201, 664
## $ Hamilton   <dbl> 126, 394, 0, 315, 295, 109, 106, 331, 532, 193, 294
## $ Hastings   <dbl> 441, 235, 315, 0, 20, 240, 319, 235, 302, 328, 606
## $ Napier     <dbl> 421, 215, 295, 20, 0, 220, 299, 250, 319, 308, 585
## $ Rotorua    <dbl> 235, 286, 109, 240, 220, 0, 86, 301, 449, 86, 401
## $ Tauranga   <dbl> 205, 298, 106, 319, 299, 86, 0, 439, 546, 97, 370
## $ Whanganui  <dbl> 457, 467, 331, 235, 250, 301, 439, 0, 188, 388, 617
## $ Wellington <dbl> 658, 538, 532, 302, 319, 449, 546, 188, 0, 535, 806
## $ Whakatane  <dbl> 298, 201, 193, 328, 308, 86, 97, 388, 535, 0, 483
## $ Whangarei  <dbl> 165, 664, 294, 606, 585, 401, 370, 617, 806, 483, 0
pheatmap(north_island, cluster_rows = TRUE,
         treeheight_row = 2, treeheight_col = 2,
         fontsize_row = 12, fontsize_col = 12,
         cellwidth = 26, cellheight = 26)

mds <- cmdscale(north_island, eig = TRUE)
mds
## $points
##                  [,1]       [,2]
## Auckland    259.23245   67.43013
## Gisborne   -107.54173 -285.70950
## Hamilton    129.07943   42.71295
## Hastings   -173.12950  -25.15974
## Napier     -150.83765  -34.70680
## Rotorua      37.39858  -18.39760
## Tauranga    118.78535  -85.88683
## Whanganui  -192.73988  181.50600
## Wellington -385.83172  167.76477
## Whakatane    49.93256 -140.17112
## Whangarei   415.65212  130.61774
## 
## $eig
##  [1]  5.249373e+05  1.953521e+05  4.217767e+04  1.872276e+04  1.222717e+03
##  [6]  2.910383e-11 -1.399691e+02 -4.733140e+02 -1.103819e+04 -1.883151e+04
## [11] -2.462990e+04
## 
## $x
## NULL
## 
## $ac
## [1] 0
## 
## $GOF
## [1] 0.8600209 0.9206005
Eckmans colour perception (1954)

Data may from objects for which we have similarities but no underlying (geometric) space. Here the goal is to understand the underlying dimensionality of colour perception.

  • Similarities for 14 colours, with wavelengths from 434 to 674nm based on rating by 31 subjects
  • Each pair of colours was rated on a 5-point scale:
    • 0 = no similarity up to 4 = identical.
  • After averaging over 31 raters the similarities were divided by 4 such that they are within the unit interval.
glimpse(ekman) 
## Rows: 14
## Columns: 14
## $ w4        <dbl> 0.00, 0.86, 0.42, 0.42, 0.18, 0.06, 0.07, 0.04, 0.02, 0.07, …
## $ `34.w4`   <dbl> 0.86, 0.00, 0.50, 0.44, 0.22, 0.09, 0.07, 0.07, 0.02, 0.04, …
## $ `45.w4`   <dbl> 0.42, 0.50, 0.00, 0.81, 0.47, 0.17, 0.10, 0.08, 0.02, 0.01, …
## $ `65.w4`   <dbl> 0.42, 0.44, 0.81, 0.00, 0.54, 0.25, 0.10, 0.09, 0.02, 0.01, …
## $ `72.w4`   <dbl> 0.18, 0.22, 0.47, 0.54, 0.00, 0.61, 0.31, 0.26, 0.07, 0.02, …
## $ `90.w5`   <dbl> 0.06, 0.09, 0.17, 0.25, 0.61, 0.00, 0.62, 0.45, 0.14, 0.08, …
## $ `04.w5`   <dbl> 0.07, 0.07, 0.10, 0.10, 0.31, 0.62, 0.00, 0.73, 0.22, 0.14, …
## $ `37.w5`   <dbl> 0.04, 0.07, 0.08, 0.09, 0.26, 0.45, 0.73, 0.00, 0.33, 0.19, …
## $ `55.w5`   <dbl> 0.02, 0.02, 0.02, 0.02, 0.07, 0.14, 0.22, 0.33, 0.00, 0.58, …
## $ `84.w600` <dbl> 0.07, 0.04, 0.01, 0.01, 0.02, 0.08, 0.14, 0.19, 0.58, 0.00, …
## $ w610      <dbl> 0.09, 0.07, 0.02, 0.00, 0.02, 0.02, 0.05, 0.04, 0.37, 0.74, …
## $ w628      <dbl> 0.12, 0.11, 0.01, 0.01, 0.01, 0.02, 0.02, 0.03, 0.27, 0.50, …
## $ w651.w    <dbl> 0.13, 0.13, 0.05, 0.02, 0.02, 0.02, 0.02, 0.02, 0.20, 0.41, …
## $ `674`     <dbl> 0.16, 0.14, 0.03, 0.04, 0.00, 0.01, 0.00, 0.02, 0.23, 0.28, …
ekman.mds <- cmdscale(ekman, eig = TRUE)
ekman.mds
round(ekman.mds$eig,2)
autoplot(ekman.mds)
Distances (in km) between 21 cities in Europe

TASK On a piece of paper roughly draw the shape of Europe. From memory alone mark the location of the following cities Athens, Barcelona, Brussels, Calais, Cherbourg, Cologne, Copenhagen, Geneva, Gibraltar, Hamburg, Hook of Holland, Lisbon, Lyons, Madrid, Marseilles, Milan, Munich, Paris, Rome, Stockholm, and Vienna.

## Plotting Multidimensional Scaling (for interest)
## stats::cmdscale performs Classical MDS
autoplot(eurodist)
## Plotting Classical (Metric) Multidimensional Scaling
autoplot(cmdscale(eurodist, eig = TRUE))
autoplot(cmdscale(eurodist, eig = TRUE), label = TRUE, shape = FALSE,
         label.size = 3)
## Plotting Non-metric Multidimensional Scaling
## MASS::isoMDS and MASS::sammon perform Non-metric MDS
autoplot(sammon(eurodist))
autoplot(sammon(eurodist), shape = FALSE, label = TRUE,label.size = 3)
## Have a go at interpreting these plots based on the geography of the cities :-)

Correspondence Analysis (CA)

CA is a special case of metric MDS where the distance measure is the chi-square distance. It is conceptually similar to principal component analysis but where the data are categorical, counts, rather than continuous. CA is traditionally applied to contingency tables where rows and columns are treated equivalently; it decomposes the chi-square statistic associated with this table into orthogonal factors. Correspondence analysis is usually the best way to follow up on a significant chi-square test.

glimpse(HairEyeColor)
##  'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
##   ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
##   ..$ Sex : chr [1:2] "Male" "Female"
HC.df <- as.data.frame.matrix(HairEyeColor[ , , 2])
HC.df
##       Brown Blue Hazel Green
## Black    36    9     5     2
## Brown    66   34    29    14
## Red      16    7     7     7
## Blond     4   64     5     8
chisq.test(HC.df)
## 
##  Pearson's Chi-squared test
## 
## data:  HC.df
## X-squared = 106.66, df = 9, p-value < 2.2e-16
coaHC <- dudi.coa(HC.df, scannf = FALSE, nf = 2)

The first axis shows a contrast between black haired and blonde haired students, mirrored by the brown eye, blue eye contrast. In CA the two categories, rows and columns play symmetric roles and we interpret the proximity of Blue eyes and Blond hair as showing strong co-occurence of these categories.

Biplot barycentric scaling

  • Row points at the centre of gravity of the column levels with their respective weights
  • Blue eyes at centre of gravity of the (Black, Brown, Red, Blond) with weights proportional to (9,34,7,64), the hair counts for blue eyes.
  • The Blond row point is very heavily weighted so Blond hair and Blue eyes close together

Non-metric Multidimensional Scaling

Multidimensional scaling aims to minimize the difference between the squared distances \(D^2\) from the distance matrix \(D\), and the squared distances between the points with their new coordinates. Unfortunately, this objective tends to be sensitive to outliers: one single data point with large distances to everyone else can dominate, and thus skew, the whole analysis.

Under certain circumstances trying to preserve the actual dissimilarities might be too restrictive or even pointless. For example if there is large error in the dissimilarity estimates, if the dissimilarities or the data they were based on were ranks (ordinal), then the magnitude of the distances are too crude to be worth preserving. A method that preserved only the rank order of the dissimilarities would be more appropriate. The algorithm to do this is virtually the same as the one given above for metric scaling. The sole difference is that the linear regression that fitted the estimated distances for the solution to the dissimilarities is now replaced with an order preserving regression.

Robust ordination, or non-metric multidimensional scaling (NMDS), attempts to embed the points in a new space such that the order of the reconstructed distances in the new map is the same as the ordering of the original distance matrix. NMDS looks for a transformation f() of the given dissimilarities, distances d. The quality of the approximation can be measured by the standardized residual sum of squares (STRESS) function: \(\text{Stress}^2 = \frac{\Sigma(f(d) - \tilde{d})^2}{\Sigma d^2}\) where \(f(d)\approx \tilde{d}\).

NMDS is not sequential:

  • we have to specify the underlying dimensionality k at the outset (like kmeans)
  • optimization is run to maximize the reconstruction of the distances in k dimensions.
  • there is no notion of percentage of variation explained by individual axes as provided in PCA.
  • as for kmeans Make a screeplot for \(k = 1,2,3,...\) and looking at how well the STRESS drops.
  • because each calculation of a NMDS result librarys a new optimization that is both random and dependent on the value of k, we repeat the process M times

Examples in R

Use the function metaMDS from the vegan package; metaMDS performs NMDS, and tries to find a stable solution using several random starts. In addition, it standardizes the scaling in the result, so that the configurations are easier to interpret.

Illustration with k = 2

nMDS.2 <- replicate(100, metaMDS(ekman, k = 2, autotransform = FALSE)) 
stressplot(metaMDS(ekman, k = 2, autotransform = FALSE), pch = 20, cex = 2)
## Run 0 stress 0.2705898 
## Run 1 stress 0.2711545 
## Run 2 stress 0.2739054 
## Run 3 stress 0.2634663 
## ... New best solution
## ... Procrustes: rmse 0.2448256  max resid 0.4517853 
## Run 4 stress 0.265124 
## Run 5 stress 0.261966 
## ... New best solution
## ... Procrustes: rmse 0.1819642  max resid 0.5616079 
## Run 6 stress 0.2752863 
## Run 7 stress 0.2634591 
## Run 8 stress 0.2664506 
## Run 9 stress 0.2658195 
## Run 10 stress 0.2655221 
## Run 11 stress 0.2683486 
## Run 12 stress 0.2682856 
## Run 13 stress 0.272819 
## Run 14 stress 0.2611536 
## ... New best solution
## ... Procrustes: rmse 0.1314703  max resid 0.2913875 
## Run 15 stress 0.278313 
## Run 16 stress 0.3000136 
## Run 17 stress 0.2702804 
## Run 18 stress 0.2770341 
## Run 19 stress 0.2713099 
## Run 20 stress 0.2635924 
## *** Best solution was not repeated -- monoMDS stopping criteria:
##      1: no. of iterations >= maxit
##     18: stress ratio > sratmax
##      1: scale factor of the gradient < sfgrmin

Linear Discriminant Analysis (LDA)

LDA is a supervised learning technique: The main goal is to predict some feature of interest using sing one or more variables (the predictors)

Example in R

glimpse(diabetes)
## Rows: 144
## Columns: 7
## $ id      <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,…
## $ relwt   <dbl> 0.81, 0.94, 1.00, 0.91, 0.99, 0.90, 0.96, 0.74, 1.10, 0.83, 0.…
## $ glufast <dbl> 80, 105, 90, 100, 97, 91, 78, 86, 90, 85, 90, 95, 92, 98, 86, …
## $ glutest <dbl> 356, 319, 323, 350, 379, 353, 290, 312, 364, 296, 378, 347, 38…
## $ steady  <dbl> 124, 143, 240, 221, 142, 221, 136, 208, 152, 116, 136, 184, 27…
## $ insulin <dbl> 55, 105, 143, 119, 98, 53, 142, 68, 76, 60, 47, 91, 74, 158, 1…
## $ group   <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…

The data

diabetes$group <- factor(diabetes$group)
diabetes
## # A tibble: 144 × 7
##       id relwt glufast glutest steady insulin group
##    <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl> <fct>
##  1     1  0.81      80     356    124      55 3    
##  2     3  0.94     105     319    143     105 3    
##  3     5  1         90     323    240     143 3    
##  4     7  0.91     100     350    221     119 3    
##  5     9  0.99      97     379    142      98 3    
##  6    11  0.9       91     353    221      53 3    
##  7    13  0.96      78     290    136     142 3    
##  8    15  0.74      86     312    208      68 3    
##  9    17  1.1       90     364    152      76 3    
## 10    19  0.83      85     296    116      60 3    
## # ℹ 134 more rows

Some variables can predict group of a patient

ggplot(reshape2::melt(diabetes, id.vars = c("id", "group")),
       aes(x = value, col = group)) +
  geom_density() + facet_wrap( ~variable, ncol = 1, scales = "free") +
  theme(legend.position = "bottom")

Possible classification rules?

ggplot(diabetes, mapping = aes(x = insulin, y = glutest)) +
  theme_bw() +
  geom_point(aes(colour = group), size = 3) +
  labs( x = "insulin" , y = "glutest") +
  theme(axis.title = element_text( size = 16),
        axis.text  = element_text(size = 12))

Carrying out LDA

Some similarity to regression

diabetes_lda  <-  lda(group ~ insulin + glutest, data = diabetes)
diabetes_lda
## Call:
## lda(group ~ insulin + glutest, data = diabetes)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2222222 0.2500000 0.5277778 
## 
## Group means:
##    insulin   glutest
## 1 320.9375 1027.3750
## 2 208.9722  493.9444
## 3 114.0000  349.9737
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## insulin -0.004463900 -0.01591192
## glutest -0.005784238  0.00480830
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9677 0.0323

Components of diabetes_lda

  • diabetes_lda$prior gives the prior probabilities of belonging to each group. By default these reflect the proportions of membership in the data:
prop.table(table(diabetes$group))
## 
##         1         2         3 
## 0.2222222 0.2500000 0.5277778

–> randomly chosen subject has probability 0.52 of coming from group 3

  • diabetes_lda$mean gives the means of each predictor in each group:

  • Proportion of Trace gives the percentage separation achieved by each discriminant function

  • diabetes_lda$scaling contains the linear discriminant functions (i.e., the linear combination of variables giving best separation between groups):

diabetes_lda$scaling
##                  LD1         LD2
## insulin -0.004463900 -0.01591192
## glutest -0.005784238  0.00480830

i.e.,

  • LD1: \(-0.00446 \times \text{insulin} - 0.00578 \times \text{glutest}\)

  • LD2: \(-0.01591 \times \text{insulin} + 0.00481 \times \text{glutest}\)

How well does LDA do on training data?

ghat <- predict(diabetes_lda)$class
table(prediced = ghat, observed = diabetes$group)
##         observed
## prediced  1  2  3
##        1 25  0  0
##        2  6 24  6
##        3  1 12 70

The missclassification rate is therefore

mean(ghat != diabetes$group)
## [1] 0.1736111

Prediction

diabetes.pred <- predict(diabetes_lda)
str(diabetes.pred)
## List of 3
##  $ class    : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
##  $ posterior: num [1:144, 1:3] 0.00000104 0.00000106 0.00000247 0.00000326 0.00000477 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:144] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:3] "1" "2" "3"
##  $ x        : num [1:144, 1:2] 1.62 1.61 1.42 1.37 1.29 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:144] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "LD1" "LD2"
  • $class: predicted group for each observation
  • $posterior: probability of falling into each group
  • $x: matrix with 2 columns one for each LD score

Output

  • Every possible point is classified to one of three groups
  • The divisions between groups are linear (Linear Discriminant Analysis)

The three ellipses represent the class centres and the covariance matrix of the LDA model. Note there is only one covariance matrix, which is the same for all three classes. This results in

  • the sizes and orientations of the ellipses being the same for the three classes (only their centres differ)
  • the ellipses represent contours of equal class membership probability.

A key assumption of LDA is that the correlations between variables are the same in each group (i.e., common covariance matrix).

Recall that, by default, the prior probabilities are the initial proportions. What if we set equal prior probabilities?

The confusion matrix/missclassification rate:

equal.ghat <- predict(diabetes_lda, prior = rep(1,3)/3)$class
table(predicted = equal.ghat,observed = diabetes$group)
##          observed
## predicted  1  2  3
##         1 25  0  0
##         2  7 28  9
##         3  0  8 67
## missclassification rate
mean(equal.ghat != diabetes$group)
## [1] 0.1666667

There are now 8 cases classified as Group 3 with prior weights classified as Group 2 with equal weights \(\rightarrow\) bias towards group with larger initial size.

Module 6

R packages and datasets

It is assumed in all following examples of this module that the following code has been executed successfully.

packages used in this module

library(tidyverse) ## general functionality

datasets used in this module

base_url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/"
  1. lobster13
lobster <- read_csv(paste(base_url, "lobster.csv", sep = ""))

Least Squares Estimation

Some basic matrix algebra

This section is a recap only, if you need a more in-depth overiew of matrix algebra then use the extra materials provided at the start of this module.

Matrices are commonly written in box brackets or parentheses and are typically denoted by upper case bold letters (e.g., A) with elements represented by the corresponding lower case indexed letters:

\[\mathbf {A} = \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix}\]

The entry in the \(i^{th}\) row and \(j^{th}\) column of a matrix A is often referred to as the \((i,j)^{th}\) entry of the matrix, and most commonly denoted as \(a_{i,j}\), or \(a_{ij}\).

Matrix addition

The sum A + B of two m-by-n matrices A and B: \[\begin{array}{rl} \mathbf {A} + \mathbf {B} &= \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix} + \begin{bmatrix}b_{11}&b_{12}&\cdots &b_{1n}\\b_{21}&b_{22}&\cdots &b_{2n}\\\vdots &\vdots &\ddots &\vdots \\b_{m1}&b_{m2}&\cdots &b_{mn}\end{bmatrix} \\ &= \begin{bmatrix}a_{11} +b_{11}&a_{12}+b_{12}&\cdots &a_{1n}+b_{1n}\\a_{21}+b_{21}&a_{22}+b_{22}&\cdots &a_{2n}+b_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}+b_{m1}&a_{m2}+b_{m2}&\cdots &a_{mn}+b_{mn}\end{bmatrix}\end{array}\]

Scalar multiplication

The product cA of a number c and a matrix A:

\[c\times \mathbf {A} = \begin{bmatrix}c\times a_{11}&c\times a_{12}&\cdots &c\times a_{1n}\\c\times a_{21}&c\times a_{22}&\cdots &c\times a_{2n}\\\vdots &\vdots &\ddots &\vdots \\c\times a_{m1}&c\times a_{m2}&\cdots &c\times a_{mn}\end{bmatrix}\]

Matrix transposition

The transpose of an m-by-n matrix A is the n-by-m matrix A\(^T\):

\[\mathbf {A}^\text{T} = \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix}^\text{T} = \begin{bmatrix}a_{11}&a_{21}&\cdots &a_{m1}\\a_{12}&a_{22}&\cdots &a_{m2}\\\vdots &\vdots &\ddots &\vdots \\a_{1n}&a_{2n}&\cdots &a_{mn}\end{bmatrix}\]

Matrix multiplication

Multiplication of two matrices is defined if and only if (\(iif\)) the number of columns of the left matrix is the same as the number of rows of the right matrix. If A is an m-by-n matrix and B is an n-by-p matrix, then their matrix product AB is the m-by-p matrix whose entries are given by dot product of the corresponding row of A and the corresponding column of B:

\[[\mathbf {AB} ]_{i,j}=a_{i,1}b_{1,j}+a_{i,2}b_{2,j}+\cdots +a_{i,n}b_{n,j}=\sum _{r=1}^{n}a_{i,r}b_{r,j}\]

Note Matrix multiplication satisfies the rules (AB)C = A(BC) (associativity), and (A + B)C = AC + BC as well as C(A + B) = CA + CB (left and right distributivity), whenever the size of the matrices is such that the various products are defined. The product AB may be defined without BA being defined, namely if A and B are m-by-n and n-by-k matrices, respectively, and m \(\neq\) k. Even if both products are defined, they generally need not be equal, that is: AB \(\neq\) BA.

Linear least squares

Recall the linear regression (with a simple explanatory variable) equation from Module 2:

\[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where

\[\epsilon_i \sim \text{Normal}(0,\sigma^2).\]

Here for observation \(i\)

  • \(Y_i\) is the value of the response
  • \(x_i\) is the value of the explanatory variable
  • \(\epsilon_i\) is the error term: the difference between \(Y_i\) and its expected value
  • \(\alpha\) is the intercept term (a parameter to be estimated), and
  • \(\beta_1\) is the slope: coefficient of the explanatory variable (a parameter to be estimated)

Recall, that the Euclidean distance between two points is calculated as

\[\sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}\]

The method of least squares estimation works by minimising the distances between each observation and the line of fit (i.e., the residuals). The line-of best fit is the line with the smallest residual sum (i.e., all the possible red lines in the figures)!

Matrix representation of a CRD

Let’s consider the CRD outlined in the previous module, we can write the effects model using matrix representation:

\[\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\epsilon}\]

where

\(\boldsymbol{y} = \begin{bmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \\ y_{31} \\ y_{32} \\ y_{33} \\ y_{34} \end{bmatrix}\), \(\boldsymbol{X} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{bmatrix}\), \(\boldsymbol{\beta} = \begin{bmatrix} \alpha \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix}\), and \(\boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{14} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \epsilon_{24} \\ \epsilon_{31} \\ \epsilon_{32} \\ \epsilon_{33} \\ \epsilon_{34} \end{bmatrix}.\)

where \(\boldsymbol{\epsilon} \sim \text{MVN}(\boldsymbol{0}, \boldsymbol{\sigma^2 I})\). The least squares estimators of \(\boldsymbol{\beta}\) are the solutions to the \[\boldsymbol{X^{'}X\beta}=\boldsymbol{X^{'}y}\].

Recall that for a factor variable we take the one level (the first factor) into the baseline (i.e., the standard) and hence then the coefficients we estimate are compared to it (i.e., the differences in the mean). This is to ensure that the matrix \(\boldsymbol{X}\) is full rank. So

\[\boldsymbol{X} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix}\]

Now, with three factor levels the least squares estimators of \(\boldsymbol{\beta}\) are (note the hat to denote the estimates)

\[(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X^{'}y} = \boldsymbol{\hat{\beta}}= \begin{bmatrix} \hat{\alpha} - \hat{\tau_1} \\ \hat{\tau_2} - \hat{\tau_1} \\ \hat{\tau_3} - \hat{\tau_1} \end{bmatrix} \]

A numeric example

Using a simple simulated example, let

set.seed(5469)
type <- factor(rep(paste("group", 1:3), each = 4))
df <- data.frame(Group = type, Response = round(c(rnorm(4, 5), rnorm(4, 7), rnorm(4, 9)), 1))
glimpse(df)
## Rows: 12
## Columns: 2
## $ Group    <fct> group 1, group 1, group 1, group 1, group 2, group 2, group 2…
## $ Response <dbl> 5.3, 6.0, 6.6, 4.9, 7.5, 7.1, 6.3, 7.6, 8.8, 6.9, 9.2, 10.3

Using our matrix algebra

We have

\(\boldsymbol{y} = \begin{bmatrix}5.3 \\6 \\6.6 \\4.9 \\7.5 \\7.1 \\6.3 \\7.6 \\8.8 \\6.9 \\9.2 \\10.3 \\\end{bmatrix}\), \(\boldsymbol{X} = \begin{bmatrix}1&0&0 \\1&0&0 \\1&0&0 \\1&0&0 \\1&1&0 \\1&1&0 \\1&1&0 \\1&1&0 \\1&0&1 \\1&0&1 \\1&0&1 \\1&0&1 \\\end{bmatrix}\), and \(\boldsymbol{X^{'}} = \begin{bmatrix}1&1&1&1&1&1&1&1&1&1&1&1 \\0&0&0&0&1&1&1&1&0&0&0&0 \\0&0&0&0&0&0&0&0&1&1&1&1 \\\end{bmatrix}\)

so

\(\boldsymbol{X^{'}X} = \begin{bmatrix}12&4&4 \\4&4&0 \\4&0&4 \\\end{bmatrix}\), \(\boldsymbol{X^{'}y} = \begin{bmatrix}86.5 \\28.5 \\35.2 \\\end{bmatrix}\), and \((\boldsymbol{X^{'}X})^{-1} = \begin{bmatrix}0.25&-0.25&-0.25 \\-0.25&0.5&0.25 \\-0.25&0.25&0.5 \\\end{bmatrix}.\)

Therefore,

\[\boldsymbol{\hat{\beta}} = (\boldsymbol{X^{'}X})^{-1}\boldsymbol{X^{'}y} = \begin{bmatrix} \hat{\alpha} - \hat{\tau_1} \\ \hat{\tau_2} - \hat{\tau_1} \\ \hat{\tau_3} - \hat{\tau_1} \end{bmatrix} = \begin{bmatrix}0.25&-0.25&-0.25 \\-0.25&0.5&0.25 \\-0.25&0.25&0.5 \\\end{bmatrix} \times \begin{bmatrix}86.5 \\28.5 \\35.2 \\\end{bmatrix} = \begin{bmatrix}5.7 \\1.425 \\3.1 \\\end{bmatrix}\]

Using lm in R

Linear least squares estimation can be carried out in R by simply using the function lm():

mod <- lm(Response ~ Group, data = df)
summary(mod)$coefficients
##              Estimate Std. Error   t value       Pr(>|t|)
## (Intercept)     5.700  0.4934994 11.550166 0.000001065454
## Groupgroup 2    1.425  0.6979136  2.041800 0.071556522658
## Groupgroup 3    3.100  0.6979136  4.441811 0.001619249110

TASK Using a (manageable) dataset carry out the matric operations above and compare your answers to the output from lm().

Maximum likelihood estimation

Under the assumptions of a linear model then maximum likelihood estimation is equivalent to least squares. However, (as we’ll see in a later module) we often need to be more flexible!

Differeniation rules

This section is a recap only, if you need a more in-depth overview of differentiation then use the extra materials provided at the start of this module.

The constant factor rule

\[(af(x))'=af'(x)\] The sum rule

\[(f(x)+g(x))'=f'(x)+g'(x)\] The subtraction rule

\[(f(x) - g(x))'=f'(x) - g'(x)\]

The product rule

For the functions \(f\) and \(g\), the derivative of the function \(h(x) = f(x)g(x)\) with respect to \(x\) is

\[h'(x) = (fg)'(x) = f'(x)g(x) + f(x)g'(x)\]

The chain rule

The derivative of the function \(h(x)=f(g(x))h(x)=f(g(x))\) is

\[h'(x)=f'(g(x)\cdot g'(x)\]

In summary, for any functions \(f\) and \(g\) and any real numbers \(a\) and \(b\), the derivative of the function \(h(x) = a f(x)+bg(x)\) with respect to \(x\) is \(h'(x)=af'(x) + bg'(x).\)

Logarithm rules

This section is a recap only, if you need a more in-depth overview of log rules then use the extra materials provided at the start of this module.

  • Product rule: \(\text{log}(xy) = \text{log}(x) + \text{log}(y)\)
  • Quotient rule: \(\text{log}(x/y)=\text{log}(x)−\text{log}(y)\)
  • Log of power: \(\text{log}(x^y)=y\text{log}(x)\)
  • Log of e: \(\text{log}(e)=1\)
  • Log of one: \(\text{log}(1)=0\)
  • Log reciprocal: \(\text{log}(1/x)=−\text{log}(x)\)
  • Differentiating a log: \(\frac{\delta \text{log}(y)}{\delta x} = \frac{1}{y}\frac{\delta y}{\delta x}.\)

Maximum likelihood estimation for a Binomial distribution

The process of using observations to suggest a value for a parameter is called estimation. The value suggested is called the estimate of the parameter.

Let us again consider data from the published article Influence of predator identity on the strength of predator avoidance responses in lobsters., lobster.

Recall that the authors were interested in how a juvenile lobster’s size was related to its vulnerability to predation. In total, 159 juvenile lobsters were collected from their natural habitat in the Gulf of Maine, USA, and the length of each lobster’s carapace (upper shell) was measured to the nearest 3 mm. The lobsters were then tethered to the ocean floor for 24 hours. Any missing lobsters were assumed to have been consumed by a predator, while the surviving lobsters were released.

lobsters <- lobster %>%
  mutate(survived = ifelse(survived == 0, "consumed", "alive")) %>%
  group_by(survived) %>%
  tally() %>%
  pivot_wider(names_from = c(survived), values_from = n)

So, 79 of the small juvenile lobsters survived predation from a total of 159. We are interested in the probability of survival, \(\theta\), for the general population. The obvious estimate is simply to take the ratio, \(\frac{\text{number surviving juveniles}}{\text{total number juveniles}} = \frac{79}{159} = 0.4968553\)!!

There are, however, many situations where our common sense fails us. For example, what would we do if we had a regression-model situation as in Module 2 and would like to specify an alternative form for \(\theta\), such as

\[\theta = \alpha + \beta \times (\text{lobster species}).\]

How would we estimate the unknown intercept \(\alpha\) and slope(s) \(\beta\), assuming we had information on lobster species etc. For this we need a general framework for estimation that can be applied to any situation. The most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. I mean the title of this section was a bit of a giveaway!

The Likelihood

The likelihood function, \(L(\theta ; x)\) is a function of the parameter(s) \(\theta\) for fixed data \(x\) and it gives the probability of a fixed observation \(x\) for every possible value of the parameter(s) \(\theta\), \(P(X = x)\).

From above, letting \(S\) be the number of lobsters alive after the 24 hours we can assume a Binomial distribution:

\[L(\theta ; s) = P(S = s) = {n \choose s} \theta^s (1 - \theta)^{n-s}.\]

To obtain the maximum likelihood estimate (i.e., the best guess for \(\theta\) given the observed data) we need to differentiate the likelihood. That is, find it’s rate of change given data \(x\), \(\frac{\delta L}{\delta \theta}\) . We are interested in the best guess for \(\theta\); this occurs at the point when the rate of change of our likelihood is zero (i.e., \(\frac{\delta L}{\delta \theta} = 0\))

Using the product rule we differentiate \(L(\theta ; s)\):

\[ \begin{array}{cl} \frac{\delta L}{\delta \theta} &= {n \choose s}\left( s\theta^{s-1} (1 - \theta)^{n-s} + \theta^s (n-s) (1 - \theta)^{n-s-1}(-1) \right)\\ &= (1 - \theta)^{n-s-1}\theta^{s-1}\{s(1-\theta) - (n-s)\theta\} \\ &= (1 - \theta)^{n-s-1}\theta^{s-1}\{s - s\theta - n\theta + s\theta \}\\ &= (1 - \theta)^{n-s-1}\theta^{s-1}\{s - n\theta \}. \end{array}\]

Setting \(\frac{\delta L}{\delta \theta} = 0\) and solve for \(\theta\):

\[\frac{\delta L}{\delta \theta} = (1 - \theta)^{n-s-1}\theta^{s-1}\{s - n\theta \} = 0.\] There are, technically, three possible solutions to this:

  1. when \(\theta^{s-1} = 0 \rightarrow \theta = 0\),
  2. when \(s - n\theta = 0 \rightarrow\theta = \frac{s}{n}\), or
  3. when \((1 - \theta)^{n-s-1} \rightarrow\theta = 1\).

Now, remember \(\theta\) is a probability and we are after a maximum likelihood estimate based on the data, so based on the above our best guess for \(\theta\) is \[\hat{\theta} = \frac{s}{n}.\] Using this for the lobster data we get \(\hat{\theta} = \frac{79}{159}.\) Surprise surprise!!

Using R

## define the likelihood as a function of the parameter(s)
## Luckily the Binomial likelihood is already defined in R
## as dbinom()
likelihood <- function(theta) dbinom(x = 79, size = 159, prob = theta)
## Use the optimise function to optimise!!
## the second argument specifies the plausible
## interval for the parameter
## Note for a number of parameters > 1
## the optim() function is used
optimise(likelihood, c(0,1), maximum = TRUE)
## $maximum
## [1] 0.4968541
## 
## $objective
## [1] 0.06317819

Maximum likelihood estimation for a CRD

Recall the following CRD equation

\[Y_{ik} = \mu_k + \epsilon_{ik}\]

where \(Y_{ik}\) is the response for the \(k^{th}\) experimental unit (\(k = 1, ..., r_i\), where \(r_i\) is the number of experimental replications in the \(i^{th}\) level of the treatment factor) subjected to the \(i^{th}\) level of the treatment factor (\(i = 1, ..., t\),). Here \(\mu_i\) are the different (cell) means for each level of the treatment factor.

Under the assumptions of a the CRD (i.e., \(\epsilon_{ik} \sim N(0, \sigma^2)\)) then (for equal number of replicates) the estimates of the cell means (\(\mu_k\)) are found by minimising the error of the sum of squares \[SS_{\epsilon} = \Sigma_{i=1}^t \Sigma_{k=1}^{r_i}(y_{ik}-\mu_i)^2.\] Taking the partial derivatives of \(SS_{\epsilon}\) with respect to each cell mean, setting to zero, and solving each equation with give us our estimates: \[\frac{\delta SS_{\epsilon}}{\delta \mu_i} = -2 \Sigma_{i=1}^t \Sigma_{k=1}^{r_i}(y_{ik}-\mu_i) = 0.\] This works out as \(\hat{\mu_i} = \overline{y_i.}\)

So in our mask example \(\hat{\mu}_{\text{Type 1 }} = 5.7 ,\; \hat{\mu}_{\text{Type 2 }} = 7.125 \; , \&\; \hat{\mu}_{\text{Type 3 }} = 8.8 .\) Compare these estimates to those we obtained via least squares estimation in the previously.

Maximising the log-likelihood

In many situations differentiating the likelihood is tricky. Even the simple binomial example above was a bit finicky. So. we often deal with the log-likelihood (the log of the likelihood). Why?

  1. The logarithmic function \(L \mapsto \text{log}(L)\) is increasing, so the functions \(L(\theta)\) and \(\text{log}(L(\theta))\) will have the same maximum, \(\hat{\theta}\).
  2. When there are observations \(x_1, \ldots, x_n\), the likelihood \(L\) is a product as \(\text{log}(a b) = \text{log}(a) + \text{log}(b)\), the log-likelihood converts the product into a sum. It is often easier to differentiate a sum than a product, so the log-likelihood is easier to maximize while still giving the same MLE.
  3. If we need to use a computer to calculate and maximize the likelihood, there will often be numerical problems with computing the likelihood product, whereas the log-likelihood sum can be accurately calculated.

Let’s consider the binomial example above. We had \(L(\theta ; s) = {n \choose s} \theta^s (1 - \theta)^{n-s}.\)

Therefore, \[\begin{array}{cl} \text{log}(L(\theta ; s)) &= \text{log}({n \choose s}) + \text{log}(\theta^s) + \text{log}((1 - \theta)^{n-s})\\ &= \text{log}({n \choose s}) + s\text{log}(\theta) + (n-s)\text{log}(1 - \theta). \end{array}\]

Differentiating this: \[\begin{array}{cl} \frac{\delta \text{log}(L(\theta ; s))}{\delta \theta} &= 0 + \frac{s}{\theta} \times 1 + \frac{n-s}{1-\theta}\times (-1)\\ &= \frac{s}{\theta} - \frac{n-s}{1-\theta} \\ \end{array}\]

Setting this to zero we get \(\frac{s}{\theta} = \frac{n-s}{1-\theta} \rightarrow s(1-\theta) = \theta(n-s) \rightarrow s - s\theta = \theta n - s\theta \rightarrow s + (s\theta - s\theta) = \theta n \rightarrow \theta = \frac{s}{n}.\) Therefore, as above \[\hat{\theta} = \frac{s}{n}.\]

Maximum likelihood estimation for a Poisson distribution

The Poisson process counts the number of events occurring in a fixed time or space, when events occur independently and at a constant average rate, \(\lambda\).

For \(X \sim \text{Poisson}(\lambda)\),

\[f_X(x) = P(X=x)=\frac{\lambda^x}{x!}e^{-\lambda}\]

for \(x = 0,1,2,\dots\)

Maximising the Likelihood

Suppose that \(x_1, \ldots, x_n\) are iid observations from a Poisson distribution with unknown parameter \(\lambda\):

\[ L(\lambda\,; x_1, \ldots, x_n) = K e^{-n\lambda} \, \lambda^{n \overline{x}} \,,\]

where \(\overline{x} = \frac{1}{n} \sum_{i=1}^n x_i\), and \(K=\prod_{i=1}^n \frac1{x_i\,!}\) is a constant that doesn’t depend on \(\lambda\).

We differentiate \(L(\lambda\,; x_1, \ldots, x_n)\) and set to 0 to find the MLE:

\[\begin{array}{rl} 0 &= \frac{\delta}{\delta\lambda} L(\lambda\,; x_1, \ldots, x_n) \\ &= K \left( -n e^{-n\lambda} \, \lambda^{n \overline{x}} + n\overline{x} e^{-n\lambda} \, \lambda^{n \overline{x} - 1} \right)\\ &= K e^{-n\lambda} \lambda^{n \overline{x} - 1} \left(- n\lambda + n\overline{x} \right) \end{array}\]

\(\rightarrow \lambda=\infty, \lambda=0, \text{or } \lambda = \overline{x}.\)

If we know that \(L(\lambda\,; x_1, \ldots, x_n)\) reaches a unique maximum in \(0 < \lambda < \infty\) then the MLE is \(\overline{x}\).

So the maximum likelihood estimator is \[ \hat{\lambda} = \overline{x} = \frac{x_1 + \ldots + x_n}{n} \;.\]

Maximising the log-likelihood function

As above,

\[L(\lambda\,; x_1, \ldots, x_n) = \prod_{i=1}^n \frac{\lambda^{x_i}}{x_i\,!} e^{-\lambda}.\] Therefore,

\[\begin{array}{rl}\text{log}(L(\lambda\,; x_1, \ldots, x_n)) &= \sum_{i=1}^n \text{log}(\frac{\lambda^{x_i}}{x_i\,!} e^{-\lambda})\\ &= \sum_{i=1}^n \text{log}(\frac{1}{x_i\,!}) + \text{log}(\lambda^{x_i}) + \text{log}(e^{-\lambda}) \\ &= \sum_{i=1}^n \text{log}(\frac{1}{x_i\,!}) + x_i \text{log}(\lambda) + (-\lambda) \\ &= K' + \text{log}(\lambda) \sum_{i=1}^n x_i - n \lambda \quad \mbox{ where $K'$ is a constant} \\ &= K' + \text{log}(\lambda) n\overline{x} - n \lambda. \end{array}\]

Differentiate and set to 0 for the MLE:

\[\begin{array}{rcl} 0 &=& \frac{\delta}{\delta\lambda} \text{log} (L(\lambda\,; x_1, \ldots, x_n) ) \\ &=& \frac{\delta}{\delta\lambda} \left (K' + \text{log}(\lambda) n\overline{x} - n \lambda \right )\\ &=& \frac{n\overline{x}}{\lambda} - n \\ \end{array}\] assuming a unique maximum in \(0 < \lambda < \infty\) the MLE is \(\hat{\lambda} = \overline{x}\) as before.

Maximum likelihood estimation for a continuous random variable

Let \(X\) be a continuous random variable with probability density function \[f_X(x) = \left\{ \begin{array}{cl} \frac{2(s-x)}{s^2} & \mbox{for } 0 < x < s\,, \\ 0 & \mbox{otherwise}\,. \end{array} \right.\]

Here, \(s\) is a parameter to be estimated, where \(s\) is the maximum value of \(X\) and \(s>0\).

Assuming a single observation \(X=x\) the likelihood function is

\[ L(s\,;x) = \frac{2(s-x)}{s^2}\] for \(\;x < s < \infty.\)

Differentiating this

\[\begin{array}{rl} \frac{dL}{ds} &= 2 \left(-2 s^{-3}(s-x) + s^{-2}\right ) \\ &= 2s^{-3} (-2(s-x) + s) \\ &= \frac{2}{s^3} (2x-s). \end{array}\]

At the MLE, \[ \frac{\delta L}{\delta s} = 0 \implies s=\infty \quad\mbox{ or }\quad s = 2x.\]

Realistically \(s=\infty\) is not the maximum (see graph below) so \(s = 2x.\) Therefore maximum likelihood estimator is \[\hat{s} = 2X.\]

Using R to get the MLE

Let’s say we observe \(X = 3\), then to find the MLE using R we use

## define the likelihood as a function of the parameter(s)
likelihood <- function(s) (2*(s - 3))/s^2
## Use the optimise function to optimise
## the second argument specifies the plausible
## interval for the parameter
## Note for a number of parameters > 1
## the optim() function is used
optimise(likelihood, c(1,8), maximum = TRUE)
## $maximum
## [1] 6.000018
## 
## $objective
## [1] 0.1666667

How does this to compare to the exact estimator, \(\hat{s} = 2X\), we found using calculus above? Consider, too, the plot below showing \(L(s; X = 3)\).

Introduction to Bayesian statistics

“Critical thinking is an active and ongoing process. It librarys that we all think like Bayesians, updating our knowledge as new information comes in.”
— Daniel J. Levitin, A Field Guide to Lies: Critical Thinking in the Information Age

Conditional probability

The probability of the event \(A\) occurring given that the event \(B\) has already occurred is \[P(A∣B) = \frac{P(A \:\text{and}\: B)}{P(B)}\]. This is called a conditional probability. Note that \(P(A∣B)\) is not the same as \(P(B∣A)\)

An example

Rapid antigen (lateral flow) tests are rapid antigen tests used to detect SARS-COV-2 infection (COVID-19). They are very easy to use and a lot less uncomfortable than having a swab for a PCR taken!

A rapid antigen test of mine, taken Jan 2022

In summary, the lateral flow test can show a positive (\(+\)) or a negative (\(-\)) result. The person taking the test either has (infected) or does not have COVID-19 (not infected).

It is reported that the average sensitivity of the Innova lateral flow tests is \(\sim 0.787\). Breaking this down means that given you have SARS-CoV-2 (Covid19) the chance of a positive lateral flow test is 0.787.

It was also reported that the specificity of this test was 0.997. That is, the chance of a negative test given that you do not have COVID-19 is 0.997.

This can be summarised as \(P( + | \text{infected}) = 0.787\) and \(P( -| \text{not infected}) = 0.997\)

What you would probably like to know is given that the test is positive, what is the probability that you have COVID-19? \(P( \text{infected}| +) = ?\)

Let’s assume that people in the population with COVID-19 is 10% (not far off the estimated % with Omicron in London a few weeks ago); that is, \(P(\text{infected}) = 0.1\).

But, what about \(P( \text{infected}| +) ?\)

Recall, \[P(A∣B) = \frac{P(A \:\text{and}\: B)}{P(B)}.\] So,

\[P( \text{infected}| +) = \frac{P(\text{infected} \:\text{and}\: +)}{P(+)}.\]

We have, \(P(\text{infected} \:\text{and}\: +) = P(\text{infected})\times P( + | \text{infected}) = 0.1 \times 0.787 = 0.0787.\)

So, \(P(+) = P(\text{infected} \:\text{and}\: +) + P(\text{clear} \:\text{and}\: +) =\) \(0.0787 + ( 0.9 \times (1 - 0.997)) = 0.0787 + ( 0.9 \times 0.003) =\) \(0.0787 + 0.0027 = 0.0814.\)

Therefore, \(P( \text{infected}| +) = \frac{ 0.0787}{0.0814} = 0.9668305.\)

Rearranging,

\[P( \text{infected}| +) = \frac{P( + | \text{infected})P(\text{infected})}{P(+)}.\]

Bayes’ rule

Bayes’ theorem, named after British mathematician Reverend Thomas Bayes, is a mathematical formula for determining conditional probability. His work and theorems were presented in An Essay towards solving a Problem in the Doctrine of Chances, this was read to the Royal Society in 1763 after his death.

The Reverend Thomas Bayes

Theorem, Bayes’ rule The conditional probability of the event \(A\) conditional on the event \(B\) is given by

\[P(B|A) = \frac{P(A|B)P(B)}{P(A)}\]

Let’s think of this in terms of our data and hypothesis:

\[P(\text{hypothesis}∣\text{data}) = \frac{P(\text{data} | \text{hypothesis} )P(\text{hypothesis})}{P(\text{data})}\]

Recall from the previous sections that our hypotheses relate to estimating parameter values (e.g., intercepts, differences in means, slopes etc ). The formula above (Bayes’ theorem) calculates the probability of the parameter(s), say \(\theta\), values given the data.

But what is the difference here to maximum likelihood estimation (i.e., MLE, the frequentist approach)? Taking an MLE approach assumes that the parameters are fixed (i.e., they have one true value); the parameters are unknown and are to be estimated. Using this approach we typically estimate a point estimate of the parameter of interest. Taking a Bayesian approach assumes that the parameters are not fixed. Instead, parameters are assumed to come from some fixed unknown distribution (i;e., a range of plausible values). This approach assumes that we have some prior beliefs about the data (even if these beliefs are uninformative). This information is introduces a priori to the modelling framework.

\[P(B|A) = \frac{P(A|B)P(B)}{P(A)}\]

Let \(A = \theta\) and \(B = \text{data}\), then the above translates to

\[P(\theta∣\text{data}) = \frac{P(\text{data} | \theta )P(\theta)}{P(\text{data})}\]

where \(P(\theta∣\text{data})\) represents what you know after having seen the data. This is called he posterior distribution and is the basis for inference, a distribution, possibly multivariate if more than one parameter (\(\theta\)). \(P(\text{data} | \theta )\) is the likelihood, think back to the previous section. \(P(\text{data})\) is called the prior distribution and represents what you know before seeing the data. The source of much discussion about the Bayesian approach. Now.

\[P(\text{data}) = \int P(\text{data}|\theta)P(\theta)d\theta\] is typically a high-dimensional integral, difficult if not impossible to calculate.

An example using the lobster data

We define large juvenile’s as those with carapace \(\geq\) 40 mm, and otherwise we class them as small.

lobsters <- lobster %>%
  mutate(size = ifelse(size >= 40, "large", "small")) %>%
  mutate(survived = ifelse(survived == 0, "consumed", "alive")) %>%
  group_by(size, survived) %>%
  tally() %>%
  pivot_wider(names_from = c(survived), values_from = n)
size alive consumed
large 56 23
small 23 57

So, as before, 23 of the small juvenile lobsters survived predation from a total of 80. We are interested in the probability of survival, \(\theta\), for the general population of small lobsters. The obvious estimate is simply to take the ratio, \(\frac{23}{80} = 0.2875\) . But, what are the implied stats behind our common sense estimate?

Let \(S\) be the number alive after the 24 hours, then we can assume a Binomial distribution:

\[P(S = s) = {n \choose s} \theta^s (1 - \theta)^{n-s}\]

A frequentist approach would be to maximise the likelihood with respect to \(\theta\), see the previous section. This would result in the maximum likelihood estimate (MLE) of \(\hat{\theta} = \frac{23}{80} = 0.2875\)

Using a Bayesian approach we first need to start off with a prior distribution. This should reflect our prior beliefs about the parameter(s) of interest. We know \(\theta\) is a probability, so it is a continuous random variable and that lies between zero and one.

A suitable prior distribution might be the Beta defined on the interval [0, 1]. Therefore, we assume a priori \(\theta \sim \text{Beta(a, b)}\) so that \(P(\theta) = \theta^{a−1}(1 − \theta)^{b−1}\). See the plot below for the range of shapes a Beta distribution takes with different parameter values.

Recall from above that

\[P(\theta∣\text{data}) = \frac{P(\text{data} | \theta )P(\theta)}{P(\text{data})},\] which can be written as

\[P(\theta∣\text{data}) \propto P(\text{data} | \theta )P(\theta).\] The \(\propto\) means proportional to; basically this means we can ignore any terms not containing the parameters. Therefore,

\[\begin{array}{rl} P(\theta∣s) & \propto {n \choose s} \theta^s (1 - \theta)^{n-s}\theta^{a−1}(1 − \theta)^{b-1} \\ & \propto \theta^{(a+s)−1}(1 − \theta)^{(b+n−s)−1} \end{array}.\]

So the posterior distribution for survival is \(\theta | s \sim \text{Beta}(a + s, b + n - s).\)

We’re going to choose an uninformative prior (i.e., \(\text{Beta}(1, 1)\) above). So \(\theta_{\text{small}} \sim \text{Beta}(1 + 23, 1 + 80 - 23) = \text{Beta}(24, 58).\) We want the expected value of this, which already has an explicit form:

\[\mathbb{E}[\text{Beta}(a, b)] = \frac{a}{a + b} = \frac{24}{82}.\]

How does this compare to our MLE estimate from above?

Typically, Bayesian and frequentist estimates will always agree if there is sufficient data, so long as the likelihood is not explicitly ruled out by the prior.
— Olivier Gimenez, Bayesian statistics with R

Prior sensitivity

When choosing a prior distribution you should focus on what that prior means in the context of the research problem. Prior choice will influence the posterior distribution. Uninformative priors can be chosen if we wish to rely only on the likelihood (i.e., let the data speak for itself), which is itself subjective based on how/where data were collected. However, uninformative priors are in general unrealistic as equal weight is given to all values!

Prior choice and sensitivity is beyond the scope of this course; however, I would strongly encourage you to read the linked materials at the start of this module.

What would happen to our posterior distribution, above, if we were to choose a different prior?


  1. some might say↩︎

  2. Ross Ihaka has been know to express his preference for = simply due to requiring less typing↩︎

  3. although since v.4.2.0 the base pipe does now allow for a names placeholder _↩︎

  4. Zhao Cao, Weiran Jin, and Jack Blackwood from the 2023 cohort↩︎

  5. https://researchscienceinnovation.nz/case-studies/relentless-science-communication-in-the-time-of-covid/index.html↩︎

  6. A realm based on that in the Fourth Wing.↩︎

  7. Note is a factorial design has equal numbers of replicates in each group then it is said to be a balanced design; if this is not the case then it is unbalanced.↩︎

  8. Source: Partitioning beta diversity to untangle mechanisms underlying the assembly of bird communities in Mediterranean olive groves↩︎

  9. Source: Influence of predator identity on the strength of predator avoidance responses in lobsters.↩︎

  10. Source: Human Gut Microbiota from Autism Spectrum Disorder Promote Behavioral Symptoms in Mice↩︎

  11. Source: Dimensions of color vision↩︎

  12. Source: Graphical display of two-way contingency tables and Graphical methods for categorical data↩︎

  13. Source: Influence of predator identity on the strength of predator avoidance responses in lobsters.↩︎